0

Occasionally, there are below exception in production environment,

2020-01-29 17:10:46.085 ERROR 2852 --- [o-8022-exec-258] c.c.p.common.dao.SearchDao               : Search person by id failed

java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-832 [ACTIVE]
        at org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:789) ~[elasticsearch-rest-client-7.1.1.jar!/:7.1.1]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:225) ~[elasticsearch-rest-client-7.1.1.jar!/:7.1.1]
        at org.elasticsearch.client.RestClient.performRequest(RestClient.java:212) ~[elasticsearch-rest-client-7.1.1.jar!/:7.1.1]
        at org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1433) ~[elasticsearch-rest-high-level-client-7.1.1.jar!/:7.1.1]
        at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1403) ~[elasticsearch-rest-high-level-client-7.1.1.jar!/:7.1.1]
        at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1373) ~[elasticsearch-rest-high-level-client-7.1.1.jar!/:7.1.1]
        at org.elasticsearch.client.RestHighLevelClient.get(RestHighLevelClient.java:699) ~[elasticsearch-rest-high-level-client-7.1.1.jar!/:7.1.1]

Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-832 [ACTIVE]
        at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) ~[httpcore-nio-4.4.11.jar!/:4.4.11]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
        at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) ~[httpasyncclient-4.1.4.jar!/:4.1.4]
        at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) ~[httpcore-nio-4.4.11.jar!/:4.4.11]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:263) ~[httpcore-nio-4.4.11.jar!/:4.4.11]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:492) ~[httpcore-nio-4.4.11.jar!/:4.4.11]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:213) ~[httpcore-nio-4.4.11.jar!/:4.4.11]
        at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) ~[httpcore-nio-4.4.11.jar!/:4.4.11]
        at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) ~[httpcore-nio-4.4.11.jar!/:4.4.11]
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) ~[httpcore-nio-4.4.11.jar!/:4.4.11]

but it's just a simple query, not a complicated query

 curl 'http://localhost:9201/person/_doc/30154410564?pretty'

and in this time, the load is very low enter image description here

So why could exist these timeout exceptions? and there are many search query but why only this simple query by id could cause this exception?

The person index is synchronized from Oracle DB, and there is a scheduled task, every 10 min it will sync changed person to person index, and if access person index in this time, it will cause 30,000 milliseconds timeout. So how to resolve it? And it seems access by Java client, it will exist this phenomenon, but access by curl in command line it dose not exist this phenomenon.

PS:

health status index               uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   person              jb3msRw5S9ixgXN5SLd6bw   1   0  140754205     19239587     19.8gb         19.8gb

and in this time there was index write for the person index enter image description here

RestClient Config:

private final RestHighLevelClient restHighLevelClient;
restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost(host, port)));
zhuguowei
  • 8,401
  • 16
  • 70
  • 106
  • could you provide the slow queries log during these timeouts and also share ur search query and index stats(how many shards, replicas, no of doc etc) , also check the https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html during these timeouts and provide information from it. This kinda issues requires lot of data to do RCA and It is quite possible that timeout of this simple query is some other costly operation running during that time(which eats up resources) hence it happens occasionally – Amit Jan 30 '20 at 03:27
  • Something tells me it has nothing to do with your cluster which seems pretty quiet, but somehow your client can't reach your cluster and hence the query times out. Are you sure that the cluster is running on port 9201 ? – Val Jan 30 '20 at 04:26
  • @Val Yeah, It's running on port 9201. – zhuguowei Jan 30 '20 at 04:28
  • So you can curl any other query and it works, just not that one? – Val Jan 30 '20 at 04:29
  • Then are you sure that your High level REST client is properly configured? It really looks like the client can't reach the cluster and the connection times out. – Val Jan 30 '20 at 04:32
  • @Val No, It's not always timeout, It's just occasionally timeout, and I want to know which situation could cause the phenomenon. And it is production environment and run for quite a long time – zhuguowei Jan 30 '20 at 04:33
  • If you have many queries running at the same time and those queries are heavy and take a substantial amount of time, you might exhaust the number of threads that are available to handle new queries. But looking at the health metrics above, it doesn't like like your cluster is under heavy load. It would probably help if you provide more information, like your cluster configuration and the client configuration. – Val Jan 30 '20 at 04:36
  • @OpsterESNinja I have added more info, because the slow log default is disabled, so I cannot provide slow log, and app's timeout exception log exists delay(10 min) So cannot get hot threads timely. And now I guess if in this time there were index write and then refresh the index cause the phenomenon – zhuguowei Jan 30 '20 at 04:57
  • @Val I found if query person index when there are update applied to person index, it will cause timeout, So any manner could help? – zhuguowei Feb 01 '20 at 03:24

1 Answers1

1

By calling hot_threads api

 curl 'http://localhost:9201/_nodes/hot_threads?pretty'

got below info:

   100.9% (504.4ms out of 500ms) cpu usage by thread 'elasticsearch[node-1][get][T#6]'
     8/10 snapshots sharing following 33 elements
       app//org.apache.lucene.index.SingletonSortedNumericDocValues.nextDoc(SingletonSortedNumericDocValues.java:53)
       app//org.apache.lucene.codecs.lucene80.IndexedDISI.writeBitSet(IndexedDISI.java:196)
       app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer.writeValues(Lucene80DocValuesConsumer.java:214)
       app//org.apache.lucene.codecs.lucene80.Lucene80DocValuesConsumer.addNumericField(Lucene80DocValuesConsumer.java:111)
       app//org.apache.lucene.codecs.perfield.PerFieldDocValuesFormat$FieldsWriter.addNumericField(PerFieldDocValuesFormat.java:109)
       app//org.apache.lucene.index.ReadersAndUpdates.handleDVUpdates(ReadersAndUpdates.java:368)
       app//org.apache.lucene.index.ReadersAndUpdates.writeFieldUpdates(ReadersAndUpdates.java:570)
       app//org.apache.lucene.index.ReaderPool.writeAllDocValuesUpdates(ReaderPool.java:228)
       app//org.apache.lucene.index.IndexWriter.writeReaderPool(IndexWriter.java:3308)
       app//org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:520)
       app//org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:294)
       app//org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:269)
       app//org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:259)
       app//org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
       app//org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140)
       app//org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156)
       app//org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
       app//org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
       app//org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
       app//org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1548)
       app//org.elasticsearch.index.engine.InternalEngine.get(InternalEngine.java:652)
       app//org.elasticsearch.index.shard.IndexShard.get(IndexShard.java:916)
       app//org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.java:169)
       app//org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:93)
       app//org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:84)
       app//org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:106)
       app//org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:45)

it seems when query person by id, the internal auto executes refresh, and from official document got

By default, the get API is realtime, and is not affected by the refresh rate of the index (when data will become visible for search). If a document has been updated but is not yet refreshed, the get API will issue a refresh call in-place to make the document visible. This will also make other documents changed since the last refresh visible. In order to disable realtime GET, one can set the realtime parameter to false.

Note: Every time access person detail page, it will update this person's viewCount.

So I disabled realtime explicitly

        GetRequest getRequest = new GetRequest(personIndex, id.toString());
        getRequest.realtime(false);

When did so, timeout problem is solved.

zhuguowei
  • 8,401
  • 16
  • 70
  • 106