0

I would like to set up an ultra-fast SolrCloud system, ideally with guaranteed low response times. The issue is that Solr typically has around 1-5% of slow responses e.g. due to leader election, frequent words with more merging, etc.

Question: Has anybody every implemented such a solution or can point me to similar solutions or to what might be issues/caveats to consider?

I’ve been analyzing the SolrJ client and think that an approach similar to that of the LBHttpSolrClient could work – with these modifications:

  1. The client would send queries out to all relevant SolrCloud nodes in parallel (multi-threading) and use the first answer that arrives. These could be generated with a web service framework like Apache CXF.

  2. Control over the document ids, control/tracking of their distribution into shards/replicas and monitoring through ZooKeeper / cluster status (e.g. as returned from queries). Then – based on cluster setup configuration and current status (including ZooKeeper queries) – the SolrJ client could send queries to exact those nodes which should be alive and relevant for a given query.

  3. Notifying SolrJ: It would be great if SolrJ could be notified of cluster changes or services (ZooKeeper / Solr / Ranger etc.) that are temporarily not available to not lose time with them.

  4. Adding monitoring/alerting: Ideally, the SolrJ client would take the timings for all answers and report these for each node and for Zookeeper to a monitoring component (Ambari, Atlas, Log, monitoring/alerting database, send e-mail, etc.)

Any suggestions?

  • Be aware that querying all nodes will make it impossible to scale up the performance of your cluster, since you can't share the query load across multiple servers. The second part: how about using Solr's built-in document routing? You can create explicit Zookeeper watches to know when the cluster state changes for nodes. The third - if Zookeeper is gone, handle the exception and log / notify whatever shared location you'll use to coordinate between SolrJ nodes. For the fourth: that should be possible with any of the monitoring solutions. – MatsLindh Sep 05 '18 at 19:38
  • Hi MatsLindh, I’m thinking about a different solution that scales up: In my solution, SolrJ would know the exact nodes for the primary shard and of the replicas for each document based on parts of the document id (e.g. based on the customer id). Thus it would contact exactly replication factor Solr nodes minus relevant and known unavailable nodes. Regarding 2), yes, I plan to use the build-in document routing for this. But the SolrJ client would need to manage that knowledge and use it for queries and that functionality still seems to be missing.Regarding 3) and 4) I would like to be faster. – Thomas_Poetter Sep 05 '18 at 20:56
  • ... faster than with the default timeouts (2s) and also don’t want to put too much additional load on the system by setting the timeout to a much smaller value. 3) might be the simplest change that would already reduce many of the timeout waits. Maybe a solution for this exists or could most easily be done? 4) should have a synergy implement with the other solutions, be much faster and put less load on the system than with standard monitoring. – Thomas_Poetter Sep 05 '18 at 20:57
  • My point is that as long as you query all available nodes for a given shard, you won't be able to scale further than the throughput of your single, fastest node. The rest of the nodes will be swamped with traffic that they can't answer and give a proper response for. My guess is that doing manual replication and index to specific customer cores could be easier to implement and keep full control over - you can then add custom replication rules (CDCR might also help here) for how you want to distribute cores, as well as keep full track of that information. – MatsLindh Sep 05 '18 at 22:12
  • Hi MatsLindh, (manual/controlled) "replication and index to specific customer cores" - that's exactly what I plan to do. Of course the client would be configured to know the right nodes/replicas or a mathematical function would calculate/track these. No hardcoding! Do you or anybody know about any such solution? – Thomas_Poetter Sep 06 '18 at 07:02

0 Answers0