How to reliably recover/resend data through sockets when servers goes down?

Question

Let's say I have an internal system of 20+ nodes that pass data back and forth to each other through sockets where low latency is a high, high priority. How do I design it so that if a random server(s) goes down, I can recover/resend the data that was already sent but not processed by the downed server?

For example, if A was streaming data to B, but at some point B goes down without processing some of the data. If we assume A can detect that B went down, and reroute the data to C, how would I design it so that I know what data was sent to B, and now should be rereouted to C?

I'm assuming I'll have to rely on the various message queue software out there, but I'm wondering if there is also another easy way to do this!

I think you're going to have to give some more specifics. In general, reliable network protocols will have A buffer the data until an acknowledgement is received from the intended recipient. If no ACK is received within a suitable timeout, then A can retry or do something else such as "reroute to C" or give up. But your mentioning "message queues" sounds like you aren't trying to do this at the socket level or aren't sure. — cklin, Apr 04 '14 at 04:58
@cklin Thank you, yes I'm actually unsure whether to do it at the socket level or not. Basically I'm sort of searching in the dark here because of my inexperience. — Albert Lim, Apr 04 '14 at 20:06
@cklin For example, I know a tech company that uses direct sockets to communicate, and am figuring out how they do fault tolerance. Are direct sockets that much 'faster' than message queues? Do they have some custom in-house fault tolerance socket level code? Or is that just written into the protocol, as you mentioned. — Albert Lim, Apr 04 '14 at 20:21
@cklin Sorry for the rambling, but "If no ACK is received within a suitable timeout, then A can retry or do something else such as "reroute to C" or give up" - Is this basically fault tolerance? And how easy would it be to implement? Or would you recommend going the already implemented message queue route, giving up whatever latency advantage there is with sockets, if any? — Albert Lim, Apr 04 '14 at 20:54
It feels like you should gain a better fundamental understanding of networks and distributed systems before you tackle this head-on. You may wish to quickly go through the material from an introductory computer networks course on Coursera (e.g., David Weatherall's course). The scope of your questions are really beyond a post on SO. — cklin, Apr 04 '14 at 21:16
@cklin I actually did take classes a long time ago, TCP/UDP, multicast, 3 way handshakes, 7 layer stacks, the whole nine yards. Unfortunately they didn't talk about production deployment socket/message queue fault tolerance/latency/durability tradeoffs, nor can I find a good source. Thank you for the suggestion though, I'll check out David Weatherall. — Albert Lim, Apr 04 '14 at 23:03

How to reliably recover/resend data through sockets when servers goes down?

0 Answers0