redundant information intake

Question

Current Situation: Currently I have dozens of sites that send html form data to a collection server. This collection sever then resends the data on to a processing server later. Having the processing server go down is not a big deal, but losing form data means losing my job.

Goal: I want to ensure there is no single point of failure that would stop html form data from being collected.

Possible Solution: My though was to have 3 servers and then send the html form data to each of them from the websites. I would want some way to ensure that only one copy of the lead was passed from the collection servers on to the processing server.

#Users fill Form Data  It is Captured Redundantly  And processed here
website01    ->        collectionServer01    ->    processingServer
website06              collectionServer02
website24              collectionServer03
website#N

I think this is called a distributed queue??

Question:Assuming this is a distributed queue I am describing, is that a good way to meet the goal I have? Are there other ways people have used? How would you recommend ensuring only one copy gets sent from the collectionServers to the processingServer?

Not my downvote, but I have absolutely zero idea what you are talking about. Can you clarify what you mean by "web form leads"? What exactly is happening and how is that causing a problem? — Pekka, Jun 09 '12 at 01:11
@Pekka, come on now, there's a form on a page that submits sales leads ("web form leads") to a server, which records and then distributes those leads to another server(s?). He's concerned about the first (processing) server going down, resulting in the loss of possible new business/sales. — Jared Farrish, Jun 09 '12 at 01:18
I don't know that the exact definition of "web form leads" really matters here, @Pekka (though if you thought it was a poor question, why *didn't* you downvote?) I think we're looking at a queueing system, accepting data from a source/interactive user, caching it in a local database, and forwarding it to an (apparently unreliable) sink. Delays are acceptable, but the system must continue to attempt to forward until the sink acknowledges receipt. Sound right, zortacon? — Michael Petrotta, Jun 09 '12 at 01:19
@Jared, Michal - now that you guys put it that way, it makes sense. Yeah, Webmasters it is then - or maybe Serverfault? — Pekka, Jun 09 '12 at 01:25
@Pekka - I was thinking http://webmasters.stackexchange.com, didn't think of http://serverfault.com. Seems "hardwary on the heavy server stuff" to me. Y'know, those guys. — Jared Farrish, Jun 09 '12 at 01:27
yes, delays are fine, but the form information must be collected. — zortacon, Jun 09 '12 at 01:33
@Michael I don't like to downvote when it's clear that what the OP wants makes sense, and it's just a question of putting it differently. Other than that, I do my share of downvoting, as you can see in my profile :) — Pekka, Jun 09 '12 at 10:49
@Michael although arguably not as many as you. Wow, that is an impressive upvote:downvote ratio! We need more of this. Seriously. — Pekka, Jun 09 '12 at 12:31
Thanks, @Pekka, but I'm trying to change that ratio now. Playing janitor has soured me on SO, I think - I see a lot more crap than I do pearls. I want to actively look for more good content, for my own sake. By the way, I was in part teasing you with my comment - I didn't downvote this question myself either. Actually, heck, an upvote for triggering a great answer from Eric. — Michael Petrotta, Jun 09 '12 at 19:50
By the way, @Pekka, I appreciate the tone of moderation and don't-stomp-quite-so-hard-on-the-newbies you bring to Meta. If you chose to run for mod, I'd vote for you. — Michael Petrotta, Jun 09 '12 at 19:53
@Michael thanks! I won't run for reasons of time and because I feel the same as you about cleaning up crap, but it would be cool to get some new mods who have a solid mix of niceness and a heavy hand towards bad questions. The list is looking pretty good, I'm hopeful. — Pekka, Jun 09 '12 at 19:57

Eric J. · Answer 1 · 2012-06-09T02:43:41.993

If I understand your question correctly, you have something like this

Some Website

Another Website                Intake Server               Processing Server
                                (reliable)                    (unreliable)
Yet Another Website

(Customer?) leads flow from many different websites to your Intake Server, and then are forwarded along to the Processing Server. You are concerned about your Intake Server going down, because that is what you are responsible for keeping up.

The classic solution to this problem is to have 2 or more Intake Servers behind a load balancer, and to have a Master and at least one Slave database.

To avoid the risk of losing your service if you lose a data center (remember the Tsunami in Japan?) is to run your setup in multiple data centers, and use geographic load balancing to send traffic to the nearest data center or, if it fails, to one of the other data centers.

In that case, you would want to replicate all data between the various data centers (e.g. Master/Master database, with local slaves for redundancy, or Master in Data Center A plus Slave in Data Center A plus Slave of Master A in Data Center B, etc.).

I successfully used that arrangement on several occasions. There are services that manage geo load balancing in a very reliable manner (though they are not exactly cheap).

If an Intake Server goes down, the load balancer detects this condition and routes traffic to the remaining Intake Servers. If the Master database goes down, you switch to the Slave database and recover the Master.

For load balancing, here's some general information. I have had great experience using both NGinX and HAProxy as load balancers.

If you send all data to all data centers, the task of coordinating which data center sent which lead to the Processing Server is very non-trivial when you consider that you may lose one or more data centers (how do you know which leads it sent before it went down? How do you decide which data center should send which lead?). Even if you have one "Master" data center and one "Hot Stand-By" data center, it is not trivial to know where the "Hot Stand-By" needs to take up work if the "Master" goes down, if they do not constantly sync state as they would with e.g. a replicated database solution.

One of the commenters mentioned (a few times) that one can use a distributed queue to solve this problem. That is also a viable route, but one that I have less experience with than the solution I described.

What if the load balancer kicks the farm and falls in the bucket (dies, in other words)? Is that a consideration? — Jared Farrish, Jun 09 '12 at 01:26
One load balancer is a single point of failure. One can setup redundant load balancers as well. The router feeding the load balancer becomes the single point of failure. One can setup redundant routers. The ISP connection feeding the router can fail. One can use multiple ISP's. As you demand more 9's of availability, the cost increases exponentially. The hardware demands on a load balancer are relatively small, so in many environments a failed load balancer can be quickly replaced. In some environments, one really needs to have that 2nd load balancer ready to go. — Eric J., Jun 09 '12 at 01:28
Well, that makes sense. Do you see any improvement in reliability by sending duplicate form data streams (via AJAX, for instance) to several different servers? — Jared Farrish, Jun 09 '12 at 01:34
I am wanting the intake servers to be on separate ISPs in different parts of the country. I just want to be sure I am not missing something super obvious like, "oh, this is a classic distributed queue, have a look here for more info". — zortacon, Jun 09 '12 at 01:36
Oh, this is a classic distributed queue, [have a look here for more info](http://activemq.apache.org/how-do-distributed-queues-work.html). Ok, that was flip, but I couldn't help myself. — Michael Petrotta, Jun 09 '12 at 01:45
@JaredFarrish: Sending the data streams to different initial endpoints significantly increases complexity, increasing the chance that the overall solution will fail at some point. Personally I would stick with tried-and-true architectures. — Eric J., Jun 09 '12 at 02:11
Oh, no doubt. What I was sketching out is what the OP is "designing", though. Hopefully it doesn't evolve into a fiasco. — Jared Farrish, Jun 09 '12 at 02:25
@zortacon: The "real" way to use multiple data centers is to use Geo-load balancing. Updating my answer with related info. — Eric J., Jun 09 '12 at 02:39

redundant information intake

1 Answers1