Extremely high CPU load on ES and Cassandra

Question

We have set up usergrid (2.1.0) with ES 1.7.5 and Cassandra 3.7, on a very big system: 12 machines for UG, 9 for cassandra and 9 for elasticsearch. All (virtual) machines have 16 cores and 32 Gig rams. However, even at 3000 concurrent users, es and c* servers go crazy and hit 100% cpu usage. When the es cpu peaks, we can not get the /roles collection, so users can not login. When c* cpu peaks, usergrid can not connect to c*, and simply mutes all http requests.

There are no iwoaits on disk or network.

Our application depends on usergrid queries, so we do heavy query request. But, I did not expect such cpu peak on the subsystems.

Any support is appreciated.

score 0 · Answer 1 · answered Dec 16 '16 at 08:36

It took almost 10 days, and the solution came the hard way. Lessons learned, for Elasticsearch:

Never, ever use G1GC on Elasticsearch! (until it becomes default)
Avoid using "contains" queries from usergrid at all costs.
Always listen to the recommendations.

We are still getting problems on Cassandra-Usergrid communication. Whenever a node goes down (maintenance, update, etc), usergrid clients print errors for connection, and after about 15 tries, they mute all communication.

Extremely high CPU load on ES and Cassandra

1 Answers1