Is Ganglia's RRD module a bottleneck?

Question

I want to monitor a LOT of metrics on a lot of machines, and from Graphite website, I noticed that ganglia's RRD compoent probably impose scalability issue from the following FAQ explaining why whisper got invented in the first place. If the problem has not been addressed (I like ganglia2's webapp), I wonder if there is a way to use gweb2 to read whisper data.

"The second reason whisper was written is performance. RRDtool is very fast, in fact it is much faster than whisper. But the problem with RRD (at the time whisper was written) was that RRD only allows you to insert a single value into a database at a time, while whisper was written to allow the insertion of multiple data points at once, compacting them into a single write operation. The reason this improves performance drastically under high load is because Graphite operates on many many files, and with such small operations being done (write a few bytes here, a few over there, etc) the bottleneck is caused by the number of I/O operations. Consider the scenario where Graphite is receiving 100,000 distinct metric values each minute, in order to sustain that load Graphite must be able to write that many data points to disk each minute. But assume that you're underlying storage can only handle 20,000 I/O operations per minute. With RRD (at the time whisper was written), there was no chance of keeping up. But with whisper, we can keep caching the incoming data until we accumulate say 10 minutes worth of data for a given metric, then instead of doing 10 I/O operations to write those 10 data points, whisper can do it in 1 operation. The reason I have kept mentioning "at the time whisper was written" is because RRD now supports this behavior. However Graphite will continue to use whisper as long as the first issue still exists."

Define ALOT.... I know that Zenoss has evolved to use RRDCacheD to relieve this issue, but we're talking 100k datapoints. — SpacemanSpiff, Nov 09 '12 at 03:18
Also.. can you snag a FusionIO card by chance? gets past this issue nicely at the hardware level — SpacemanSpiff, Nov 09 '12 at 03:19

score 2 · Accepted Answer · answered Nov 20 '12 at 14:14

2

The issue that the whisper FAQ mentions that has been solved is called "RRDCacheD". And yes, ganglia can be configured to use this, per http://sourceforge.net/apps/trac/ganglia/wiki/rrdcached_integration , which should drastically improve the I/O performance of ganglia.

answered Nov 20 '12 at 14:14

janneb

3,841
19
22

note, often people find decent rrdtool update performance just needs sufficient cache added on the receiving host, so that the os does not need to read blocks from disk for normal updates but can go on to write immediately ... – Tobi Oetiker Jan 11 '14 at 07:56
I have a gweb / gmetad server that collect ~70 hosts. It had about 20% I/O constantly, that reduced to 3~5% after using RRDcached. The utilization is low enough to run on an old Atom 230 nettop. ;) – Nov 11 '14 at 14:59

Is Ganglia's RRD module a bottleneck?

1 Answers1