0

ES 1.7.x on CentOS

Our production ES cluster went down hard. We lost the entire index. Turns out, this had been in the logs for a day or more:

New used memory from field ### would be larger than configured breaker

OK.

What url on ES can I hit to see that issues are happening? (Log monitoring is not part of our monitoring regime, but hitting an ES url is easy for us.)

We use cluster health urls now, so we see the cluster go yellow/red, but (so far), we have not seen how to externally see problems coming (so we get clobbered).

Jonesome Reinstate Monica
  • 5,445
  • 10
  • 56
  • 82
  • Have you read the ES docs? What URLs have you tried? – GregL Nov 09 '15 at 22:17
  • @GregL I have read a lot of ES docs, but not all (I guess). I have not found docs (yet) on how to see if breakers are being tripped. (Ergo my question.) OP enhanced. – Jonesome Reinstate Monica Nov 09 '15 at 22:24
  • There's no URL that will explicitly list tripped breakers, but there are counters in one of the stats, status or health pages for each type, which lists the *number* of times they've been tripped. If you're relatively savvy you can monitor those values and when their frequency of trips goes above a given level, your monitoring system could throw an alert – GregL Nov 10 '15 at 00:52
  • @GregL Good point. It is a drag to do the math (well, it is more than a simple monitor), but poss. We will look into this. – Jonesome Reinstate Monica Nov 10 '15 at 03:34
  • @GregL OK, doced up , thanks to your inspiration, the answer below. It is an answer, but it is not pretty. – Jonesome Reinstate Monica Nov 10 '15 at 03:48

1 Answers1

1

OK, found the answer.

Frankly, it is a really weak answer, that puts a true burden on us to deal with.

As doced here:

https://www.elastic.co/guide/en/elasticsearch/reference/1.4/cluster-nodes-stats.html

Use this:

curl -XGET 'http://localhost:9200/_nodes/stats?pretty=true'

And then you can see a breakers|tripped element.

That is just a counter, not a velocity. So you have to :

  • Write your own code to read the value
  • Wait N time
  • Read again
  • Do math
  • Surface breakers tripped/min
  • Figure out what a problem threshold is for you
  • Monitor against that

It would be so very nice if ES could work out the velocity, so we could just focus on those last two points.

But this is the best there, from what I can see so far.

Jonesome Reinstate Monica
  • 5,445
  • 10
  • 56
  • 82
  • Yup, that's the way you do it. To be fair though almost all metrics you find (everywhere, not just ES) are gauges (CPU, memory) or (network traffic, http requests, breakers tripped) and providing velocities is hard since each persons timeframe is different (some want per second, others per minute). I'd be surprised if your monitoring solution didn't provide a way to do the math for you based on the polling interval and counters retrieved. – GregL Nov 10 '15 at 04:59