How does Solrcloud handle host failures?

Question

I am learning how to use Solrcloud's new features, and I can successfully set up an ensemble of Zookeepers, and a set of Solr instances for a sharded index. I wanted to investigate how failures affected my setup. Mostly, it worked as expected except for one case.

I used two machines, and started 3 Zookeepers on each (6 total). I started s Solr instance one machine (bosmac01), asking for 2 shards, and started a second instance that machine. I then started two more Solr instances on a second machine (qasolrmaster). The Solr admin showed the configuration I expected, and indexing/querying worked:

Shard1: qasolrmaster:8900 and bosmac01:8983 Shard2: qasolrmaster:8910 and bosmac01:8920

I wanted to test what would happen if one machine crashed, so I shutdown qasolrmaster. I expected that since there would be 3 Zookeepers still running, and since there would still be a Solr instance connected to each shard, that everything would still work. Instead, the two remaining Solr instances (on bosmac01) kept trying to reconnect to the missing Zookeepers. The Admin would not display the cloud image, and I could not add docs or query. Same thing happens if I just stop all the Zookeepers on qasolrmaster but leave the machine running. Re-starting one of the missing Zookeepers returned things to normal.

Why did the test fail? 3 Zookeepers plus a Solr for each shard should allow things to keep working, yes?

score 2 · Answer 1 · answered Nov 16 '12 at 13:10

2

Zk requires a majority of its nodes stay up. If you put 3 on one machine and 3 on another, then kill 3, you do not have a majority.

answered Nov 16 '12 at 13:10

Mark Miller

21
1

OK. I guess I misunderstood the documentation. I thought it was the number of ZK instances that was important, not the number of nodes. Thanks for clarifying. – user1827533 Nov 16 '12 at 13:58
In the end, the number of zk instances is what matters. But you are removing half if them by killing one node. Try putting 3 on one, 2 on another, then kill the node with 2. The other 3 will be a majority and continue functioning as an ensemble. – Mark Miller Nov 16 '12 at 14:15
You'll need at least 3 machines to make this work in a real world scenario. – sourcedelica Nov 16 '12 at 15:05
Still confused -- I created 4 VMs, one ZK on each. I stopped one of the ZKs and the others panicked. Restarting the ZK fixed it. Shouldn't the other 3 have kept working? I can try the 3/2 division you suggested, but shouldn't this have worked? Sorry if I'm being dense :( – user1827533 Nov 16 '12 at 15:22
I am facing similar issues. Even if one solr process goes down, the cluster is unusable: http://lucene.472066.n3.nabble.com/solrcloud-4-3-1-stability-and-failure-scenario-questions-td4072392.html – zengr Jun 22 '13 at 11:19

How does Solrcloud handle host failures?

1 Answers1