5

I have a 4 node elasticsearch cluster. I have a .net console application that is designed to fill the cluster with data which comes from sql. Everything works fine as long as I keep the rate of records being added (or deleted) fairly low. If I increase the number of threads eventually I will see timeout errors from my console app. The cluster has a total of 48 cores and the average time it takes to index a record is about .1 seconds.

I have been able to get it to do about 7000 records (documents) per second. I never see any exceptions thrown from elasticsearch.net that indicate low resources. I never see any of the indexing queues overloaded. The servers never peak to more than about 10% cpu. It looks like the issue is not the cluster or it's configuration but something in the nest connection. Here is my code for the connection:

//set up the es client
Uri node = new    Uri(ConfigurationManager.AppSettings["ESConnectionString"]);
var connectionPool = new SniffingConnectionPool(new[] { node });
ConnectionSettings settings = new ConnectionSettings(connectionPool);
settings.SetDefaultPropertyNameInferrer(p => p); //ditch the camelcase
settings.SniffOnConnectionFault(true);
settings.SniffOnStartup(true);
settings.SniffLifeSpan(TimeSpan.FromMinutes(1));
settings.SetPingTimeout(3000);
settings.SetTimeout(5000);
settings.MaximumRetries(5);
//settings.SetMaximumAsyncConnections(20);
settings.SetDefaultIndex("dummyindex");
settings.SetBasicAuthentication(ConfigurationManager.AppSettings["ESUser"], ConfigurationManager.AppSettings["ESPass"]);
ElasticClient client = new ElasticClient(settings);

I have the cluster set up with http.basic authentication, but I have tried with it turned on and off and there is no difference. Here are some of the pertinent settings from the ES nodes:

discovery.zen.minimum_master_nodes: 2
discovery.zen.fd.ping_timeout: 30s
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["CACHE01","CACHE02","CACHE03","CACHE04"]
cluster.routing.allocation.node_concurrent_recoveries: 5
indices.recovery.max_bytes_per_sec: 50mb
http.basic.enabled: true
http.basic.user: "admin"
http.basic.password: "XXXXXXX"

At this point I can't seem to figure out if it's the .Net client that is the issue or the servers? Everything points to the client but I'm at a loss for what to try next. I don't think I can use the BulkAPI because I'm essentially just replicating changes from a SQL server and in order to keep them in sync I execute the change as soon as it's received. It seems when I'm inserting new documents I can go at a much faster pace then when updating. I have read the updating docs and it almost reads like partial updates are better than full updates, but the there is the whole get-update-delete-reindex things that seems to happen with every update.

According to the es docs I'm not supposed to tweak the thread pools or the performance settings. I don't think I'm hitting any of those limits anyway. The ES error logs don't indicate any issue either.

Anyone have advice on what I can do to track down the connection errors?

UPDATE: This is the actual error:

Error: Unexpected result (SaveToES). Elasticsearch.Net.Exceptions.MaxRetryException: Sniffing known nodes in the cluster caused a maxretry exception of its own ---> Elasticsearch.Net.Exceptions.SniffException: Sniffing known nodes in the cluster caused a maxretry exception of its own ---> Elasticsearch.Net.Exceptions.MaxRetryException: Retry timeout 00:00:05 was hit after retrying 1 times: 'GET _nodes/_all/clear?timeout=3000'. InnerException: WebException, InnerMessage: The operation has timed out, InnerStackTrace: at System.Net.HttpWebRequest.GetResponse() at Elasticsearch.Net.Connection.HttpConnection.DoSynchronousRequest(HttpWebRequest request, Byte[] data, IRequestConfiguration requestSpecificConfig) InnerException: WebException, InnerMessage: The operation has timed out, InnerStackTrace: at System.Net.HttpWebRequest.GetResponse() at Elasticsearch.Net.Connection.HttpConnection.DoSynchronousRequest(HttpWebRequest request, Byte[] data, IRequestConfiguration requestSpecificConfig) ---> System.AggregateException: One or more errors occurred. ---> System.Net.WebException: The operation has timed out at System.Net.HttpWebRequest.GetResponse() at Elasticsearch.Net.Connection.HttpConnection.DoSynchronousRequest(HttpWebRequest request, Byte[] data, IRequestConfiguration requestSpecificConfig) --- End of inner exception stack trace --- --- End of inner exception stack trace --- at Elasticsearch.Net.Connection.RequestHandlers.RequestHandlerBase.ThrowMaxRetryExceptionWhenNeeded[T](TransportRequestState1 requestState, Int32 maxRetries) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.RetryRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.DoRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.RetryRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.DoRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.Request[T](TransportRequestState1 requestState, Object data) at Elasticsearch.Net.Connection.Transport.Elasticsearch.Net.Connection.ITransportDelegator.Sniff(ITransportRequestState ownerState) --- End of inner exception stack trace --- --- End of inner exception stack trace --- at Elasticsearch.Net.Connection.Transport.Elasticsearch.Net.Connection.ITransportDelegator.Sniff(ITransportRequestState ownerState) at Elasticsearch.Net.Connection.Transport.Elasticsearch.Net.Connection.ITransportDelegator.SniffClusterState(ITransportRequestState requestState) at Elasticsearch.Net.Connection.Transport.Elasticsearch.Net.Connection.ITransportDelegator.SniffOnConnectionFailure(ITransportRequestState requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.RetryRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.DoRequest[T](TransportRequestState1 requestState) at Elasticsearch.Net.Connection.RequestHandlers.RequestHandler.Request[T](TransportRequestState1 requestState, Object data) at Elasticsearch.Net.Connection.Transport.DoRequest[T](String method, String path, Object data, IRequestParameters requestParameters) at Elasticsearch.Net.ElasticsearchClient.DoRequest[T](String method, String path, Object data, IRequestParameters requestParameters) at Elasticsearch.Net.ElasticsearchClient.IndicesCreatePost[T](String index, Object body, Func2 requestParameters) at Nest.RawDispatch.IndicesCreateDispatch[T](ElasticsearchPathInfo1 pathInfo, Object body) at Nest.ElasticClient.<CreateIndex>b__281_0(ElasticsearchPathInfo1 p, ICreateIndexRequest d) at Nest.ElasticClient.Nest.IHighLevelToLowLevelDispatcher.Dispatch[D,Q,R](D descriptor, Func3 dispatch) at Nest.ElasticClient.CreateIndex(Func2 createIndexSelector) at DCSCache.esvRepository.CreateIndex(String IndexName, String IndexVersion) at DCSCache.esvRepository.Save(esv ItemToSave, String IndexName, String IndexVersion)

Frederik Struck-Schøning
  • 12,981
  • 8
  • 59
  • 68
user2033791
  • 810
  • 12
  • 23
  • 1
    What version of NEST are you using? What errors or exceptions are being thrown - do you have some example stack traces? What does the mapping for the documents that you are indexing look like? The Bulk API would probably be a better fit for this - each operation in the Bulk request is independent of the others so you can retry those that fail in the one request. Using something like TPL Dataflow with a buffering queue and timeout to bulk index requests may work well – Russ Cam Nov 23 '15 at 06:45
  • Thanks Russ, Looking into the bulk api a bit more. I didn't think it would work for my scenario but after re-reading the docs I think it might be my answer. – user2033791 Nov 23 '15 at 15:03
  • 1
    The stack trace is for max retries when getting info about the nodes in the cluster, although I expect you max retry exceptions for other requests. A few things that might be worth considering: (1) Increasing the `Timeout` from 5 seconds to something higher, say 20-30 seconds (the default for this is 60 seconds, IIRC) (2) Consider enabling Http compression with the Http module (https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-http.html) (3) Have a look at `.SetMaxRetryTimeout()` in conjunction with `.MaximumRetries()`, setting a timeout on retries – Russ Cam Nov 23 '15 at 21:46
  • All excellent suggestions. I will look into all of them. I did get the bulk API to insert 15,000 docs yesterday in a minute without any errors--previously the best I could do was 7,000 a minute. Our actual needs are far less than that on a normal operating basis but I'm trying to get a feel for our limits before we take the system live. Thanks again for your help. – user2033791 Nov 24 '15 at 11:25

0 Answers0