Fault tolerant system design

Question

There is a DB as data store and y (>5) other machines. There is a machine A that has data (updated) every x mins. The y machines gets the data from Machine A every x mins, updates the data in the database. Every machine doing the same is for some fault tolerance. Is there a clean way to model the working with fault tolerance?

Any pointers is appreciated.

score 0 · Answer 1 · answered Aug 02 '12 at 11:09

0

This is a problem with very large scope. How is the data structured? How are the "db loaders" get the data from the "data producing" machine? What happens if an update fails- is the data lost or must it be persisted at any cost?

I will make some assumptions and suggest a solution: 1. The data can be partitioned. 2. You have access to a central persistent buffer. e.g. MSMQ or WebSphere MQ.

The machine generating the data puts chunks inside a central queue. Each chunk is composed of a set of record IDs and the new values for relevant properties)- you decide the granularity. The "db loaders" listen to the queue and each de-queues a chunk (the contention is only on the dequeue-stage and is very optimized) and updates its own set of ids. This way insert work is distributed among the machines, each handles its own portion, and if one crashes, well- the others wills simply work a bit harder.

In case of a failure to update you can return the chunk to the queue and retry later (transactional read).

answered Aug 02 '12 at 11:09

Vitaliy

8,044
7
38
66

Thanks. data producing machine doesn't have any record ids. lets say its just a (type1 type2) pair without duplicates and the database be sql server. the data from data producing machine cannot be dequeued. the db loaders get record id, which leads to the possibility of duplicate records in the db – Sam Aug 02 '12 at 20:18
Where do the db loaders get the ids from? Why cant you use a queue? Note- it does not have to be a persistent queue as I described. It can be an in-memory queue inside the producer that is exposed to the loaders by a service.. – Vitaliy Aug 02 '12 at 20:57
I do not have control over the data producing machine. The db loaders generate a new id, say a guid. – Sam Aug 03 '12 at 03:20
No problem. you can have an intermediate process that will get data from the producer and and chunk it up for the db loaders. – Vitaliy Aug 03 '12 at 06:25
the intermediate process becomes a single point of failure right? – Sam Aug 03 '12 at 08:11
Not necessarily. You can have two processes deployed in two different servers in an active-passive configuration. By the way- when requesting the data from the producer, can you request only part of it? Moreover, this process is very simple and it will be easy to achieve almost 100% availability. The main question is: what is are the consistency and availability requirements? Can you afford to lose data one in a while? – Vitaliy Aug 03 '12 at 11:00
Thanks a lot! Could you explain active-passive configuration please? Also, I cannot consume part of it. I want to use all the machines for fault tolerance. Also, I cannot afford to lose data. – Sam Aug 03 '12 at 11:08
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/14862/discussion-between-vitaliy-and-sam) – Vitaliy Aug 03 '12 at 15:48

Fault tolerant system design

1 Answers1