0

I have a 15 node elasticsearch cluster and am indexing a lot of documents. The documents are of the form { "message": "some sentences" }. When I had a 9 node cluster, I could get CPU utilization upto 80% on all of them, when I turned it into a 15 node cluster, i get 90% CPU usage on 4 nodes and only ~50% on the rest.

The specification of the cluster is:

15 Nodes c4.2xlarge EC2 insatnces

15 shards, no replicas

There is load balancer in-front of all the instances and the instances are accessed through the load balancer.

Marvel is running and is used to monitor the cluster

Refresh interval 1s

I could index 50k docs/sec on 9 nodes and only 70k docs/sec on 15 nodes. Shouldn't I be able to do more?

1 Answers1

0

I'm not yet an expert on scalability and load balancing in ES but some things to consider :

  • load balancing should be native in ES thus having a load balancer in-front can actually mitigate the in-house load balancing results. It's kind of like having a speed limitation on your car but manually using the brakes, it doesn't make that much sense since your speed limitator should already do the job and will be prevented from doing it right when you input "manual regulation". Have you tried not using your load balancer and just using the native load balancing to see how it fares ?
  • while having more CPU / computation power across different servers / shards, it also forces you to go through multiple shards every time you write/read a document, thus if 1 shard can do N computations, M shards won't actually be able to do M*N computations
  • having 15 shards is probably overkill in a lot of cases
  • having 15 shards but no replication is weird/bad since if any of your 15 servers falls, you won't be able to access your whole index
  • you can actually hold multiple nodes on a single server

What is your index size in terms of storage ?

  • The above setup is not used in production, it is for benchmarking. Using elasticsearch inbuilt load balancing makes it difficult to replace failed nodes automatically. If the server which the client first connects to fails, we need to change the client code to point to the new server. I have tried indexing the documents to 2 indexes of 7 shards each are the results are similar. So, maybe the problem is not too many shards. – Sacheendra Talluri May 20 '15 at 16:23
  • I don't think what you say is true, ElasticSearch is supposed to have a master node to "rule them all" and create a new entry point into the cluster when a client node fails, also pointing to his older data via other shards when you have replication. I don't think you'd need to recompile anything so that you can connect to your "new server", since it is in fact the same server with a node less, which should be transparent. See this link : http://stackoverflow.com/questions/24751025/is-using-a-load-balancer-with-elasticsearch-unnecessary – Christophe Schutz May 21 '15 at 08:06
  • Consider the case where the node specified in the application is down, the newly started application would not would not be able to connect to that and thus would not able to discover the cluster and it would be down. Using that approach, the application would need to contain a hardcoded set of servers to connect to or a service discovery mechanism for initial discovery. I am talking about the case where the application is newly started and the node we have specified is down. – Sacheendra Talluri May 22 '15 at 15:47
  • Nope, ElasticSearch automatically assigns a new master & client node, even if the "next-in-line" is down. Obviously, if all of your servers are down, it won't work, but nothing can prevent that... – Christophe Schutz May 26 '15 at 07:17
  • A new master is automatically elected, that is true. But, in the application we have specified a node to connect to, when the application starts that node is not available, then the application cannot know at which address the master is because there is no initial communication with cluster. – Sacheendra Talluri May 26 '15 at 19:37
  • The load balancing / coping with failure works just as well with client nodes. Please read this link : https://blog.liip.ch/archive/2013/07/19/on-elasticsearch-performance.html – Christophe Schutz May 27 '15 at 08:44