SolrCloud with Zookeeper - cancel_stream_error & TimeoutException: Idle timeout expired: 120000/120000 ms

Question

I have a solrCloud setup in Kubernetes with 2 Solr instances and 3 ZooKeeper instances with 1 shard. It is configured with 8G persistent storage for each Solr and Zookeeper. The Memory allocated for Solr is 16G with 10G Heap size. There are a max of 2.5million records indexed. There scheduler client which will call the Solr with url - /update/json?wt=json&commit=true - to do the add/update/delete operations. Occasionally there will be a huge update/delete happens with 1 million records which will call the api (/update/json?wt=json&commit=true ) with 500 documents at a time, but this is called in multiple threads. Everything works fine 1 week, but suddenly we saw errors in Solr.log which makes the solr in an error state and I had to restart one of the solr node. The error are:

Node 1:

021-04-09 08:20:56.657 ERROR (updateExecutor-5-thread-169-processing-x:datacore_shard1_replica_n1 r:core_node3 null n:solr-1.solrcluster:8983_solr c:datacore s:shard1) [c:datacore s:shard1 r:core_node3 x:datacore_shard1_replica_n1] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling SolrCmdDistributor$Req: cmd=add{,id=S-170262-P-108028200-F-800001737-E-180905508}; node=ForwardNode: http://solr-0.solrcluster:8983/solr/datacore_shard1_replica_n2/ to http://solr-0.solrcluster:8983/solr/datacore_shard1_replica_n2/ => java.io.IOException: java.io.IOException: cancel_stream_error at org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredContentProvider.java:193) java.io.IOException: java.io.IOException: cancel_stream_error at org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredContentProvider.java:193) ~[?:?]

Node2:

2021-04-09 08:22:56.661 INFO (qtp1632497828-35124) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.u.p.LogUpdateProcessorFactory [datacore_shard1_replica_n2] webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=http://solr-1.solrcluster:8983/solr/datacore_shard1_replica_n1/&wt=javabin&version=2}{} 0 119999 2021-04-09 08:22:56.661 ERROR (qtp1632497828-35124) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.h.RequestHandlerBase java.io.IOException: java.util.concurrent.TimeoutException: Idle timeout expired: 120000/120000 ms at org.eclipse.jetty.server.HttpInput$ErrorState.noContent(HttpInput.java:1085) at org.eclipse.jetty.server.HttpInput.read(HttpInput.java:318)

And on both nodes we can see the below error as well -

2021-04-09 08:21:00.812 INFO (qtp1632497828-35036) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.u.p.LogUpdateProcessorFactory [datacore_shard1_replica_n2] webapp=/solr path=/update params={update.distrib=TOLEADER&distrib.from=http://solr-1.solrcluster:8983/solr/datacore_shard1_replica_n1/&wt=javabin&version=2}{} 0 120770 2021-04-09 08:21:00.812 ERROR (qtp1632497828-35036) [c:datacore s:shard1 r:core_node4 x:datacore_shard1_replica_n2] o.a.s.h.RequestHandlerBase java.io.IOException: Task queue processing has stalled for 90013 ms with 0 remaining elements to process. at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient.blockUntilFinished(ConcurrentUpdateHttp2SolrClient.java:501)

The stall time is set at 90000ms.

Why we are getting these errors? Why is it stalling for long? We have the average doc size of 1Kb. How can we resolve this problem?

SolrCloud with Zookeeper - cancel_stream_error & TimeoutException: Idle timeout expired: 120000/120000 ms

0 Answers0