1

One of our nodes in our 3 node cluster is down and on checking the log file, it shows the below messages

INFO  [keyspace.core Index WorkPool work thread-2] 2016-09-14 14:05:32,891  AbstractMetrics.java:114 - Cannot record QUEUE latency of 11 minutes because higher than 10 minutes.
INFO  [keyspace.core Index WorkPool work thread-2] 2016-09-14 14:05:33,233  AbstractMetrics.java:114 - Cannot record QUEUE latency of 10 minutes because higher than 10 minutes.
WARN  [keyspace.core Index WorkPool work thread-2] 2016-09-14 14:05:33,398  Worker.java:99 - Interrupt/timeout detected.
java.util.concurrent.BrokenBarrierException: null
at java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:200) ~[na:1.7.0_79]
at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:355) ~[na:1.7.0_79]
at com.datastax.bdp.concurrent.FlushTask.bulkSync(FlushTask.java:76) ~[dse-core-4.8.3.jar:4.8.3]
at com.datastax.bdp.concurrent.Worker.run(Worker.java:94) ~[dse-core-4.8.3.jar:4.8.3]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_79]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
WARN  [keyspace.core Index WorkPool work thread-2] 2016-09-14 14:05:33,398  Worker.java:99 - Interrupt/timeout detected.
java.util.concurrent.BrokenBarrierException: null
at java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:200) ~[na:1.7.0_79]
at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:355) ~[na:1.7.0_79]
at com.datastax.bdp.concurrent.FlushTask.bulkSync(FlushTask.java:76) ~[dse-core-4.8.3.jar:4.8.3]
at com.datastax.bdp.concurrent.Worker.run(Worker.java:94) ~[dse-core-4.8.3.jar:4.8.3]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_79]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_79]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_79]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
INFO  [keyspace.core Index WorkPool work thread-4] 2016-09-14 14:05:33,720  AbstractMetrics.java:114 - Cannot record QUEUE latency of 13 minutes because higher than 10 minutes.
INFO  [keyspace.core Index WorkPool work thread-4] 2016-09-14 14:05:33,721  AbstractMetrics.java:114 - Cannot record QUEUE latency of 13 minutes because higher than 10 minutes.

The nodes configuration are 8 CPU, 32 GB RAM, 500 GB Disk space. What could be the reasons for only one particular node going down?

Hitesh
  • 3,449
  • 8
  • 39
  • 57

1 Answers1

0

So I'm going to answer with some general info here, your case might be more complex. 32GB RAM might not be large enough for a Solr node; using the G1 collector on Java 1.8 has proved better for Solr with heap sizes above 26GB.

I'm also not sure what heap sizes, JVM settings and how many solr cores you have here. However, I've seen similar errors to this when a node is busy indexing and its trying to keep up. Once of the most common problems seen on Solr nodes in my experience is where the max_solr_concurrency_per_core is left at default (commented out) in the dse.yaml. This will typically allocate the number of indexing threads to the number of CPU cores, and to further compound the problem, you might see 8 cores but if you have HT then its actually likely 4 physical cores.

Check your dse.yaml and make sure you are setting it to num physcal cpu cores / num of solr cores with 2 at a minimum. This might index slower but you should remove the pressure off of your node.

I'd recommend this useful blog here as a good start to tuning DSE Solr:

http://www.datastax.com/dev/blog/tuning-dse-search

Also docs on the subject:

https://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchTune.html

markc
  • 2,129
  • 16
  • 27
  • Thanks for the answer, though it did not help completely, but it helped me improve performance. I started using G1 collecter by upgrading to Java 1.8 from 1.7 and this has helped a lot. To answer your queries, my heap size is 14GB, default JVM settings, and about 150 solr cores.And yes my `max_solr_concurrency_per_core` is left at default. (P.S. I did not downvote) – Hitesh Oct 27 '16 at 05:10
  • @Hitesh glad you found it useful. I had to give a generic answer as without seeing the complete log and all your config its hard to give a more detailed answer. You should definitely tune down the concurrency as it will default to the num of cpu cores which could overload your cpu. You probably also want to up the size of your heap. I've seen guys in the field mention around 26GB is a good starting point – markc Oct 31 '16 at 10:26
  • thanks, I will consider your suggestion on the heap size. Regarding the concurrency, what do you suggest it should be based upon my configuration? – Hitesh Nov 14 '16 at 12:32
  • @Hitesh as I mentioned above start by using the documented suggestion of `num physcal cpu cores / num of solr cores` with 2 at a minimum: https://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/srch/srchThrds.html – markc Nov 15 '16 at 10:41
  • I have about 200 solr cores, so I am unable to understand how will that calculation fit for me. For my each node, there is 8 CPU. Shall i set `max_solr_concurrency_per_core` to 6, so that 2 CPU be utilized for other operations? – Hitesh Nov 15 '16 at 11:01
  • @Hitesh It depends on how many are indexing at any one time. 200 is a rather high amount of cores, are you writing to all of them all the time? – markc Nov 15 '16 at 12:15
  • no, most of the writes are happening all the time on around 20 cores – Hitesh Nov 15 '16 at 12:19
  • Even with 20 cores indexing its still probably too much for those 8 core machines to handle and they are probably 4 physical CPU cores as I pointed out. You will probably need to scale up to be able to index 20 cores al at the same time. – markc Nov 15 '16 at 12:26
  • So you are suggesting that each of my nodes should be atleast 20 core machines, right? That might help my situation. – Hitesh Nov 16 '16 at 07:29
  • So ideally you should reserve 2 _physical_ cores per solr core. This is a guide, if you have say 4 solr cores and your concurrency per core is 2 then ideally 8 physical cores. It does depend on whether you're writing to all the tables that back the solr cores at the same time. This is why you need to test but yes it does sound like bigger CPUs would also help here – markc Nov 16 '16 at 10:59