5

I have 8 node solr cloud cluster connected with external zookeeper. Each node : 30 Gb, 4 core. I have created around 100 collections, each collection is having approx. 30 shards. (Why I need it, let be a different story, business isolation, business requirement could be anything).

Now, I am ingesting data into cluster on 30 collections simultaneously. I see that ingestion to few collections is getting failed. In solr logs, I can see this "Connection Reset" exception occurring. Overall time for ingestion is in the tune of 10 hours.

Any suggestion? Even if it is due to resource starvation how can I prove that connection reset is coming because of lack of resources.

==== Exception ======

2015-01-30 09:16:14,454 ERROR [updateExecutor-1-thread-8151] ? (:) - error
java.net.SocketException: Connection reset

at java.net.SocketInputStream.read(SocketInputStream.java:196) ~[?:1.7.0_55]
at java.net.SocketInputStream.read(SocketInputStream.java:122) ~[?:1.7.0_55]
at org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:160) ~[httpcore-4.3.jar:4.3]
at org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:84) ~[httpcore-4.3.jar:4.3]
at org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:273) ~[httpcore-4.3.jar:4.3]
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:140) ~[httpclient-4.3.1.jar:4.3.1]
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57) ~[httpclient-4.3.1.jar:4.3.1]
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260) ~[httpcore-4.3.jar:4.3]
at org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:283) ~[httpcore-4.3.jar:4.3]
at org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:251) ~[httpclient-4.3.1.jar:4.3.1]
at org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:197) ~[httpclient-4.3.1.jar:4.3.1]
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271) ~[httpcore-4.3.jar:4.3]
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) ~[httpcore-4.3.jar:4.3]
at org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:682) ~[httpclient-4.3.1.jar:4.3.1]
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:486) ~[httpclient-4.3.1.jar:4.3.1]
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) ~[httpclient-4.3.1.jar:4.3.1]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[httpclient-4.3.1.jar:4.3.1]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106) ~[httpclient-4.3.1.jar:4.3.1]
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) ~[httpclient-4.3.1.jar:4.3.1]
at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:233) [solr-solrj-4.10.0.jar:4.10.0 1620776 - rjernst - 2014-08-26 20:49:51]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [?:1.7.0_55]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [?:1.7.0_55]
at java.lang.Thread.run(Thread.java:745) [?:1.7.0_55]`enter code here`
nxG
  • 61
  • 4

1 Answers1

0

Based on my experience this happens when the cpu is maxed out, even though it may be for a minute, it can cause update failures.

This is because it can't perform any other operations except the one it is currently trying to finish. It has the update in the queue, but in the meanwhile the zookeeper marks it down since it can't communicate with the shard. Since, the shard is marked down (this is what causes the connection reset), the node that sent the update gets the update shard error.

If you have a replica row, the replica shard becomes the master and the current node becomes a replica. This triggers a suck down of the index from the new master shard.

You could prevent this from happening by slowing down querying while indexing and also slowing down indexing too. Also add more cpus or adding more ram would help as well.

Draken
  • 3,134
  • 13
  • 34
  • 54
Mohan
  • 1
  • I agree with the CPU concerns. I support Alfresco Solr, and I have observed that another cause of problems would be garbage connections. It is useful tunning GC to reduce long connections. G1GC helps, but in my experience, it requires significantly more memory. Another possibility is reducing the number of active threads. – luiscolorado Jan 15 '18 at 21:43
  • We have already this setting, but no help -XX:+AggressiveOpts-XX:+ParallelRefProcEnabled-XX:+CMSParallelRemarkEnabled-XX:CMSMaxAbortablePrecleanTime=6000-XX:CMSTriggerPermRatio=80-XX:CMSInitiatingOccupancyFraction=70-XX:+UseCMSInitiatingOccupancyOnly-XX:CMSFullGCsBeforeCompaction=1-XX:PretenureSizeThreshold=64m-XX:+CMSScavengeBeforeRemark-XX:+UseConcMarkSweepGC-XX:MaxTenuringThreshold=8-XX:TargetSurvivorRatio=90-XX:SurvivorRatio=4-XX:NewRatio=3 @luiscolorado – Sthita Jul 04 '19 at 03:24