0

We have set up usergrid (2.1.0) with ES 1.7.5 and Cassandra 3.7, on a very big system: 12 machines for UG, 9 for cassandra and 9 for elasticsearch. All (virtual) machines have 16 cores and 32 Gig rams. However, even at 3000 concurrent users, es and c* servers go crazy and hit 100% cpu usage. When the es cpu peaks, we can not get the /roles collection, so users can not login. When c* cpu peaks, usergrid can not connect to c*, and simply mutes all http requests.

There are no iwoaits on disk or network.

Our application depends on usergrid queries, so we do heavy query request. But, I did not expect such cpu peak on the subsystems.

Any support is appreciated.

Eren Yilmaz
  • 1,082
  • 12
  • 21

1 Answers1

0

It took almost 10 days, and the solution came the hard way. Lessons learned, for Elasticsearch:

  1. Never, ever use G1GC on Elasticsearch! (until it becomes default)
  2. Avoid using "contains" queries from usergrid at all costs.
  3. Always listen to the recommendations.

We are still getting problems on Cassandra-Usergrid communication. Whenever a node goes down (maintenance, update, etc), usergrid clients print errors for connection, and after about 15 tries, they mute all communication.

Eren Yilmaz
  • 1,082
  • 12
  • 21