Distributed Metrics

Question

I have been working on a single box application which uses codehale metrics heavily for instrumentation. Right now we are moving to cloud and I have below questions on how I can monitor metrics when the application is distributed.

Is there a metrics reporter that can write metrics data to Cassandra?
When and how does the aggregation happen if there are records per server in the database?
Can I define the time interval at which the metrics data gets saved into the database?
Are there any inbuilt frameworks that are available to achieve this?

Thanks a bunch and appreciate all your help.

pandaadb · Answer 1 · 2016-07-06T09:08:00.593

I am answering your questions first, but I think you are misunderstanding how to use Metrics.

You can google this fairly easily. I don't know of any (I also don't understand what you'll do with it in cassandra?). You would normally use something like graphite for that. In any case, a reporter implementation is very straight forward and easy.
That question does not make too much sense. Why would you aggregate over 2 different servers - they are independent. Each of your monitored instances should be standalone. Aggregation happens on the receiving side (e.g. graphite)
You can - see 1. Write a reporter, and configure it accordingly.
Not that i know of.

Now to metrics in general:

I think you are having the wrong idea. You can monitor X servers, that is not a problem at all, but you should not aggregate that on the client side (or database side). how would that even work? Restarts zero the clients, and essentially that means you need to track the state of each of your servers so that your aggregation does work. How do you manage outages?

The way you should monitor your servers with metrics:

create a namespace

io.my.server.{hostname}.my.metric

now you have X different namespaces, but they all have a common prefix. That means, you have grouped them.

Send them to your prefered monitoring solution.

There are heaps out there. I do not understand why you want this to be cassandra - what kind of advantage do you gain from that? http://graphite.wikidot.com/ for example is a graphng solution. Your applications can automatically submit data there (graphite comes with a reporter in java that you can use). See http://graphite.wikidot.com/screen-shots on how it looks like.

The main point is that graphite (and all or most providers) know how to handle your namespaces. E.g. also look at Zabix, which can do the same thing.

Aggregations

Now the aggregation happens on the receiving side. Your provider knows how to do that, and you can define rules.

For example, you could wildcard alerts like:

io.my.server.{hostname}.my.metric.count > X

Graphite (I believe) even supports operations, e.g:

sum(io.my.server.{hostname}.my.metric.request) - which would sum up ALL your hosts's requests

That is where the aggregation happens. At that point, your servers are again standalone (as they should), and have no dependency on each other or any monitoring database etc. They simply report on their own metrics (which is what they should do) and you - as the consumer of those metrics - are responsible to make the right alerts/aggregations/formulars on the receiving end.

Aggregating this on server side would involve:

Discover all other servers
Monitor their state
Receive/send metrics back and forth
Synchronise what they report etc

That just sounds like a nightmare for maintenance :) I hope that gives you some inside/ideas.

(Disclaimer: Neither a metrics dev nur a graphite dev - this is just how I did this in the past/ and the approach I still use)

Edit:

With your comment in mind, here are my two fave solutions on what you want to achieve:

DB

you can use the DB and store dates e.g. for start message and end message. This is not really a metric thing so maybe not preferred. As per your question you could write your own reporter on that, but it would get complicated with regards to upserts/updates etc. I think option 2 is easier and has more potential.

Logs

This is I think what you need. Your servers independently log on Start/Stop/Pause etc - whatever it is you want to report on. You then set up logstash and collect those logs. Logstash allows you to track these events over time and create metrics on it, see:

https://www.elastic.co/guide/en/logstash/current/plugins-filters-metrics.html

Or:

https://github.com/logstash-plugins/logstash-filter-elapsed

The first one uses actual metrics. The second one is a different plugin that just measures times between start/stop events.

This is the option with the most potential because it does not rely on any format/ any data store or anything other. You even get Kibana for plotting out of the box if you use the entire ELK stack.

Say you wanted to measure your messages. You can just look for the logs, there are no application changes involved. The solution does not even touch on your application (e.g. storing your reporting data manually does take up threads and processing in your applications, so if you need to be real-time compatible this will put your overall performance down), it is a complete standalone solution. Later on, when wanting to measure other metrics, you can easily add to your logstash configuration and start doing other metrics.

I hope this helps

Thanks for your response. I'm looking at application level instrumentation rather than system level metrics. Lets say I want to capture how long the system took to process a single incoming message and if there is a cluster of servers I would n't be able to find out which server processed which message and how long it took. As a whole for each client I should be able to say the system processed so many messages per second. Sorry if my initial question did not convey the same. Thanks for your help. — Neoster, Jul 06 '16 at 00:40

Distributed Metrics

1 Answers1