4

We're using Cacti with RRDTool to monitor and graph about 100,000 counters spread across about 1,000 Linux-based nodes. However, our current setup generally only gives us 5-minute graphs (with some data being minute-based); we often make changes where seeing feedback in "near real time" would be of value. I'd like approximately a week of 5- or 10-second data, a year of 1-minute data, and 5 years of 10-minute data. I have SSD disks and a dual-hexa-core server to spare.

I tried setting up a Graphite/carbon/whisper server, and had about 15 nodes pipe to it, but it only has "average" for the retention function when promoting to older buckets. This is almost useless -- I'd like min, max, average, standard deviation, and perhaps "total sum" and "number of samples" or perhaps "95th percentile" available. The developer claims there's a new back-end "in beta" that allows you to write your own function, but this appears to still only do 1:1 retention (when saving older data, you really want the statistics calculated into many streams from a single input. Also, "in beta" seems a little risky for this installation. If I'm wrong about this assumption, I'd be happy to be shown my error!

I've heard Zabbix recommended, but it puts data into MySQL or some other SQL database. 100,000 counters on a 5 second interval means 20,000 tps, and while I have an SSD, I don't have an 8-way RAID-6 with battery backup cache, which I think I'd need for that to work out :-) Again, if that's actually something that's not a problem, I'd be happy to be shown the error of my ways. Also, can Zabbix do the single data stream -> promote with statistics thing?

Finally, Munin claims to have a new 2.0 coming out "in beta" right now, and it boasts custom retention plans. However, again, it's that "in beta" part -- has anyone used that for real, and at scale? How did it perform, if so?

I'm almost thinking about using a graphing front-end (such as Graphite) and rolling my own retention backend with a simple layer on top of mmap() and some stats. That wouldn't be particularly hard, and would probably perform very well, letting the kernel figure out the balance between frequency of flushing to disk and process operations.

Any other suggestions I should look into? Note: it has to have shown itself able to sustain the kinds of data loads I'm suggesting above; if you can point at the specific implementation you're referencing, so much the better!

Jon Watte
  • 270
  • 2
  • 8
  • 1
    what on earth are you needing that takes 100 counters per host ?! At 5 second resolution no less. – Sirex Jun 21 '11 at 10:16
  • there are some other retention functions now http://graphite.readthedocs.org/en/latest/config-carbon.html#storage-aggregation-conf – hellvinz Apr 13 '12 at 09:21
  • The problem is that you only get one. I want average, standard deviation (or variance), min and max for each counter for each bucket! – Jon Watte Apr 15 '12 at 21:09
  • Shopping Questions are Off-Topic on any of the [se] sites. See [Q&A is hard, lets go Shopping](http://blog.stackoverflow.com/2010/11/qa-is-hard-lets-go-shopping) and the [FAQ] for more details. – Chris S Aug 27 '12 at 14:19
  • Btw: Because of our admittedly not-yet-mainstream continuous deployment monitoring needs, we ended up re-building this wheel, and the solution is open source: https://github.com/imvu-open/istatd/wiki – Jon Watte Sep 27 '12 at 16:48

4 Answers4

1

Have you looked at Ganglia?

I strongly doubt Munin would scale to your size. But Ganglia is designed from the ground up for large clusters of servers.

EightBitTony
  • 9,311
  • 1
  • 34
  • 46
1

Zabbix is known to perform well on 1000+ hosts environments, your 5 second refresh is a little unheard (maybe you need that periodicity for the majority of them and something like 30sec for some of them is OK for you).

Zabbix proxies (think of them as mini-Zabbix Servers) are advocated in huge installations to reduce load over Zabbix Server. http://www.packtpub.com/article/proxies-monitor-remote-locations-zabbix-1.8

From Alexei himself:

"It will collect performance and availability data, also perform auto-discovery on ZABBIX Server behalf:

  1. It is immune to communication problems. Data is locally stored.
  2. It requires one-way (Proxy to Server) TCP connections only.
  3. Almost zero maintenance. For example, if a local Proxy database does not exist, the proxy will create one automatically. So, basically a binary and small configuration file is required to setup a proxy.
  4. Configuration is stored and fully managed on Server side from the normal WEB GUI. "
Joao Figueiredo
  • 208
  • 2
  • 9
0

Check out Graphite, http://graphite.wikidot.com/. They have this to say about high capacity:

Graphite was internally developed by Orbitz.com where it is used to visualize a variety of operations-critical data including application metrics, database metrics, sales, etc. At the time of this writing, the production system at Orbitz can handle approximately 160,000 distinct metrics per minute running on two niagra-2 Sun servers on a very fast SAN.

Kendall
  • 1,063
  • 12
  • 25
0

I'm with the other people who have commented asking why you need to monitor so many items on such a short frequency. The biggest problem with doing that is your monitoring system will begin to cause false positives with regards to load and you are reducing the CPU time available for other processing. Moving your monitoring interval from 5 to 15 seconds will cause an 80% drop in monitoring overhead, and still provide you with at least double the normal minimum visibility which is often around 30 seconds. Also when you look closer you may determine that some items do not need to be monitored every 15 or 30 seconds. One example is disk, you may be able to handle once every 30 or 60 seconds. For example if you only only write 1.7MB/s you're only going to be able to push 100MB in a minute. If your monitoring system is set to alarm at 1GB for instance you now have about 100 minutes before you are out of disk (using this slow disk example). CPU, why do you need to know what it's doing at a resolution less than 30 seconds? It's loaded at 100% in a cloud, great, it's doing some work like a cluster node should be doing. But if it's at 100% load when it's work queue is 0, then you have a problem.

Also monitoring on such a tight frequency increases your chase for false positives due to induced artifacts into your data set. For instance if your monitoring system is causing a base load of 20% and 100KB/s from monitoring everything at an interval of 5 seconds, are you really getting an accurate picture of what your host is doing? As for false positives, consider triggering on network load of 500KB/s, your monitoring system alone puts you 20% of the way there.

Also you have not suggested anything in the above which makes me think Zabbix cannot handle what you want to do. Give it a shot, we'll be waiting in the Zabbix community to help you when needed.

Red Tux
  • 2,074
  • 13
  • 14