Graphite over spot instances scaling

Question

i have over a hundred servers that are sending metrics to my statsd-graphite setup. A leaf out of the metric subtree is something like-

stats.dev.medusa.ip-10-0-30-61.jaguar.v4.outbox.get

stats.dev.medusa.ip-10-0-30-62.jaguar.v4.outbox.get

Most of my crawlers are AWS spot-instances, which implies that 20s of them go down and up randomly, being allocated different IP addresses every time. This implies that the same list becomes-

stats.dev.medusa.ip-10-0-30-6.<|subtree

stats.dev.medusa.ip-10-0-30-1.<|subtree

stats.dev.medusa.ip-10-0-30-26.<|subtree

stats.dev.medusa.ip-10-0-30-21.<|subtree

Assuming all the metrics under the subtree in total store 4G metrics, 20 spot instances going down and later 30 of them spawning up with different IP addresses implies that my storage suddenly puffs up by 120G. Moreover, this is a weekly occurrence.

While it is simple and straightforward to delete the older IP-subtrees, but i really want to retain the metrics. i can have 3 medusas at week0, 23 medusas at week1, 15 in week2, 40 in week4. What can be my options? How would you tackle this?

score 0 · Answer 1 · answered Feb 18 '14 at 13:09

We achieve this by not logging the ip address. Use a deterministic locking concept and when instances come up they request a machine id. They can then use this machine id instead of the ip address for the statsd bucket.

stats.dev.medusa.machine-1.<|subtree
stats.dev.medusa.machine-2.<|subtree

This will mean you should only have up to 40 of these buckets. We are using this concept successfully, with a simple number allocator api on a separate machine that allocates the instance numbers. Once a machine has an instance number it stores it as a tag on that machine, so our allocator can query the tags of the ec2 instances to see what is being used at the moment. This allows it to re-allocate old machine ids.

Graphite over spot instances scaling

1 Answers1