Elasticsearch.Net and Timeouts

Question

I have a 4 node elasticsearch cluster. I have a .net console application that is designed to fill the cluster with data which comes from sql. Everything works fine as long as I keep the rate of records being added (or deleted) fairly low. If I increase the number of threads eventually I will see timeout errors from my console app. The cluster has a total of 48 cores and the average time it takes to index a record is about .1 seconds.

I have been able to get it to do about 7000 records (documents) per second. I never see any exceptions thrown from elasticsearch.net that indicate low resources. I never see any of the indexing queues overloaded. The servers never peak to more than about 10% cpu. It looks like the issue is not the cluster or it's configuration but something in the nest connection. Here is my code for the connection:

//set up the es client
Uri node = new    Uri(ConfigurationManager.AppSettings["ESConnectionString"]);
var connectionPool = new SniffingConnectionPool(new[] { node });
ConnectionSettings settings = new ConnectionSettings(connectionPool);
settings.SetDefaultPropertyNameInferrer(p => p); //ditch the camelcase
settings.SniffOnConnectionFault(true);
settings.SniffOnStartup(true);
settings.SniffLifeSpan(TimeSpan.FromMinutes(1));
settings.SetPingTimeout(3000);
settings.SetTimeout(5000);
settings.MaximumRetries(5);
//settings.SetMaximumAsyncConnections(20);
settings.SetDefaultIndex("dummyindex");
settings.SetBasicAuthentication(ConfigurationManager.AppSettings["ESUser"], ConfigurationManager.AppSettings["ESPass"]);
ElasticClient client = new ElasticClient(settings);

I have the cluster set up with http.basic authentication, but I have tried with it turned on and off and there is no difference. Here are some of the pertinent settings from the ES nodes:

discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_timeout: 30s
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["CACHE01","CACHE02","CACHE03","CACHE04"]
cluster.routing.allocation.node_concurrent_recoveries: 5
indices.recovery.max_bytes_per_sec: 50mb
http.basic.enabled: true
http.basic.user: "admin"
http.basic.password: "XXXXXXX"

At this point I can't seem to figure out if it's the .Net client that is the issue or the servers? Everything points to the client but I'm at a loss for what to try next. I don't think I can use the BulkAPI because I'm essentially just replicating changes from a SQL server and in order to keep them in sync I execute the change as soon as it's received. It seems when I'm inserting new documents I can go at a much faster pace then when updating. I have read the updating docs and it almost reads like partial updates are better than full updates, but the there is the whole get-update-delete-reindex things that seems to happen with every update.

According to the es docs I'm not supposed to tweak the thread pools or the performance settings. I don't think I'm hitting any of those limits anyway. The ES error logs don't indicate any issue either.

Anyone have advice on what I can do to track down the connection errors?

UPDATE: This is the actual error:

Error: Unexpected result (SaveToES). Elasticsearch.Net.Exceptions.MaxRetryException: Sniffing known nodes in the cluster caused a maxretry exception of its own ---> Elasticsearch.Net.Exceptions.SniffException: Sniffing known nodes in the cluster caused a maxretry exception of its own ---> Elasticsearch.Net.Exceptions.MaxRetryException: Retry timeout 00:00:05 was hit after retrying 1 times: 'GET _nodes/_all/clear?timeout=3000'. InnerException: WebException, InnerMessage: The operation has timed out, InnerStackTrace: at System.Net.HttpWebRequest.GetResponse() at Elasticsearch.Net.Connection.HttpConnection.DoSynchronousRequest(HttpWebRequest request, Byte[] data, IRequestConfiguration requestSpecificConfig) InnerException: WebException, InnerMessage: The operation has timed out, InnerStackTrace: at System.Net.HttpWebRequest.GetResponse() at Elasticsearch.Net.Connection.HttpConnection.DoSynchronousRequest(HttpWebRequest request, Byte[] data, IRequestConfiguration requestSpecificConfig) ---> System.AggregateException: One or more errors occurred. ---> System.Net.WebException: The operation has timed out at System.Net.HttpWebRequest.GetResponse() at Elasticsearch.Net.Connection.HttpConnection.DoSynchronousRequest(HttpWebRequest request, Byte[] data, IRequestConfiguration requestSpecificConfig) --- End of inner exception stack trace --- --- End of inner exception stack trace --- at Elasticsearch.Net.Connection.RequestHandlers.RequestHandlerBase.ThrowMaxRetryExceptionWhenNeeded[T](TransportRequestState1 requestState, Int32 maxRetries) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.RetryRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.DoRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.RetryRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.DoRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.Request[T](TransportRequestState1 requestState, Object data) at Elasticsearch.Net.Connection.Transport.Elasticsearch.Net.Connection.ITransportDelegator.Sniff(ITransportRequestState ownerState) --- End of inner exception stack trace --- --- End of inner exception stack trace --- at Elasticsearch.Net.Connection.Transport.Elasticsearch.Net.Connection.ITransportDelegator.Sniff(ITransportRequestState ownerState) at Elasticsearch.Net.Connection.Transport.Elasticsearch.Net.Connection.ITransportDelegator.SniffClusterState(ITransportRequestState requestState) at Elasticsearch.Net.Connection.Transport.Elasticsearch.Net.Connection.ITransportDelegator.SniffOnConnectionFailure(ITransportRequestState requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.RetryRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.DoRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.Request[T](TransportRequestState1 requestState, Object data) at Elasticsearch.Net.Connection.Transport.DoRequest[T](String method, String path, Object data, IRequestParameters requestParameters) at Elasticsearch.Net.ElasticsearchClient.DoRequest[T](String method, String path, Object data, IRequestParameters requestParameters) at Elasticsearch.Net.ElasticsearchClient.IndicesCreatePost[T](String index, Object body, Func2 requestParameters) at Nest.RawDispatch.IndicesCreateDispatch[T](ElasticsearchPathInfo1 pathInfo, Object body) at Nest.ElasticClient.<CreateIndex>b__281_0(ElasticsearchPathInfo1 p, ICreateIndexRequest d) at Nest.ElasticClient.Nest.IHighLevelToLowLevelDispatcher.Dispatch[D,Q,R](D descriptor, Func3 dispatch) at Nest.ElasticClient.CreateIndex(Func2 createIndexSelector) at DCSCache.esvRepository.CreateIndex(String IndexName, String IndexVersion) at DCSCache.esvRepository.Save(esv ItemToSave, String IndexName, String IndexVersion)

What version of NEST are you using? What errors or exceptions are being thrown - do you have some example stack traces? What does the mapping for the documents that you are indexing look like? The Bulk API would probably be a better fit for this - each operation in the Bulk request is independent of the others so you can retry those that fail in the one request. Using something like TPL Dataflow with a buffering queue and timeout to bulk index requests may work well — Russ Cam, Nov 23 '15 at 06:45
Thanks Russ, Looking into the bulk api a bit more. I didn't think it would work for my scenario but after re-reading the docs I think it might be my answer. — user2033791, Nov 23 '15 at 15:03
The stack trace is for max retries when getting info about the nodes in the cluster, although I expect you max retry exceptions for other requests. A few things that might be worth considering: (1) Increasing the `Timeout` from 5 seconds to something higher, say 20-30 seconds (the default for this is 60 seconds, IIRC) (2) Consider enabling Http compression with the Http module (https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-http.html) (3) Have a look at `.SetMaxRetryTimeout()` in conjunction with `.MaximumRetries()`, setting a timeout on retries — Russ Cam, Nov 23 '15 at 21:46
All excellent suggestions. I will look into all of them. I did get the bulk API to insert 15,000 docs yesterday in a minute without any errors--previously the best I could do was 7,000 a minute. Our actual needs are far less than that on a normal operating basis but I'm trying to get a feel for our limits before we take the system live. Thanks again for your help. — user2033791, Nov 24 '15 at 11:25

Elasticsearch.Net and Timeouts

0 Answers0

Linked