I am learning how to use Solrcloud's new features, and I can successfully set up an ensemble of Zookeepers, and a set of Solr instances for a sharded index. I wanted to investigate how failures affected my setup. Mostly, it worked as expected except for one case.
I used two machines, and started 3 Zookeepers on each (6 total). I started s Solr instance one machine (bosmac01), asking for 2 shards, and started a second instance that machine. I then started two more Solr instances on a second machine (qasolrmaster). The Solr admin showed the configuration I expected, and indexing/querying worked:
Shard1: qasolrmaster:8900 and bosmac01:8983 Shard2: qasolrmaster:8910 and bosmac01:8920
I wanted to test what would happen if one machine crashed, so I shutdown qasolrmaster. I expected that since there would be 3 Zookeepers still running, and since there would still be a Solr instance connected to each shard, that everything would still work. Instead, the two remaining Solr instances (on bosmac01) kept trying to reconnect to the missing Zookeepers. The Admin would not display the cloud image, and I could not add docs or query. Same thing happens if I just stop all the Zookeepers on qasolrmaster but leave the machine running. Re-starting one of the missing Zookeepers returned things to normal.
Why did the test fail? 3 Zookeepers plus a Solr for each shard should allow things to keep working, yes?