DSE/Solr: Cannot record QUEUE latency

Question

Using DSE 4.8.7, we are able to insert ~1,000 records/second into a cassandra table, which is being indexed by Solr. The throughput is fine for a while (maybe 30-60 minutes) until 2-3 nodes (in a 5-node cluster) start showing these messages in the log:

INFO  [datastore.data Index WorkPool work thread-0] 2016-05-17 19:28:26,183  AbstractMetrics.java:114 - Cannot record QUEUE latency of 29 minutes because higher than 10 minutes.

At this point, the insert throughput goes down to 2-10 records/second. Restarting the nodes solves the problem. OS load and IO are both low for all nodes in the cluster. Also, there are no pending tasks when looking at nodetool stats.

This question is almost literally verbatim the question here, which I'm doing on purpose because (a) this appears to still be an issue, and (b) I'm not able to comment on that question.

FYI I'd also like to know where AbstractMetrics.java lives. I don't see it in the solr or cassandra codebase. Is it specific to DSE? — Shion Deysarkar, May 17 '16 at 20:23
Thank you, but we've already gone through that post. We'll revisit it, but I think our current issue is outside that post. — Shion Deysarkar, May 18 '16 at 16:25
My gut says, too few concurrent indexers, you need to strike that ballance. Have you looked at the index queue jmx metrics? — phact, May 19 '16 at 05:25
were you able to solve this issue, if yes how? @phact I am still facing this issue and restarting the nodes does not solves this for me either. Ive posted a separate question for the same http://stackoverflow.com/questions/39493387/cannot-record-queue-latency-of-n-minutes-dse — Hitesh, Sep 15 '16 at 04:00
I don't think we ever solved this issue. I am not aware of a clear-cut solution. — Shion Deysarkar, Sep 16 '16 at 13:56

score 0 · Answer 1 · edited May 23 '17 at 12:06

I thought it worth posting an answer here although I also answered the following question in pretty much the same way:

Cannot record QUEUE latency of n minutes - DSE

When a Solr node is ingesting records, not only does it have to ingest into the normal Cassandra write path, it also has to ingest the records into the Solr write path too. Cassandra compaction is occurring as well as Solr's equivalent (Lucene merging). Compaction and Merging are both very expensive on disk i/o.

By default the dse.yaml will have the setting max_solr_concurrency_per_core commented out, this can mean too many threads are assigned to your indexing across your solr core(s).

The post linked in the blog above by @phact is indeed a good place to start. Monitoring the IndexPool mBean is a good place to start checking. Check the QueueDepth and see if its increasing, if so then the node cannot keep up with indexing throughput, and its time to look at CPU and I/O. If you aren't seeing high CPU then you may do well to increase the concurrency.

In large clusters typically high rates of ingestion are done in a DC with Cassandra nodes, which replicate across to Solr nodes in their own DC. A split workload like this might also be a good consideration for you.

Another thing is also the size of your index, reducing the size of things like text fields by setting things omitNorms=true in the schema can vastly decrease the size of the index too.

I'll post some doc links below here which might help you.

https://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchTune.html

https://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchCmtQryMbeans.html

DSE/Solr: Cannot record QUEUE latency

1 Answers1