I was trying to implement a simple single master node against multiple backup nodes system to learn about distributed and fault tolerant architecture.
Currently this is what my system looks like:
N different nodes, each one identical. 1 master node running a simple webserver.
All nodes communicate with each other using simple heartbeat protocol and each maintain global state (count of nodes available, who is master, downtime and uptime of each other.)
If any node does not hear from master for some set time, if raises a alarm. If a consensus is reached that the master is down, new master is elected.
If the network of nodes gets partitioned.
- And the master is in minor partition, then it will stop serving request and go down by itself after a set period of time. Minor group cannot elect master (some minimum nodes require to make decision)
- New master gets selected in the major partition after a set time after not hearing from old master.
Now I am stuck with a problem, that is, in the step 4 above, there is a time gap where the old master is still serving the requests, while new master getting elected in the major node.
This seems can cause inconsistent data across the system if some client decided to write new data to old master. How we avoid this issue. Would be glad if someone points me to right direction.