5

When operating in the cloud and scaling boxes automatically, there are certain monitoring issues that one experiences. Sometimes we might be monitoring 10 boxes and sometimes 100. The machines will scale up and down based on a demand.

Right now, I think the best solution to this is to choose a monitoring solution that will instantiation of targets via calls to an API. But, is this really the best? I like the idea of dynamic discovery, but that is also a problem in the cloud seeing that the targets are not all in the same subnet.

What monitoring solutions allow for a scaling environment like this? Zabbix currently has a draft API but I have been unable to fund a similar API for Nagios. Is there a similar API for Nagios?

Anyone have any alternate suggestions besides Nagios and Zabbix?

wickett
  • 148
  • 1
  • 8

7 Answers7

3

Farmville, which claims to be adding hundreds of servers a week, uses Puppet, Nagios, and Munin to handle their scalable monitoring system. They probably use the Puppet facts to populate Nagios config files or to setup NRPE. With that many servers a config management tool like Puppet is practically a requirement.

A couple examples found via searching "puppet nagios":

http://blog.gurski.org/index.php/2010/01/28/automatic-monitoring-with-puppet-and-nagios/

http://projects.puppetlabs.com/projects/puppet/wiki/Nagios_Patterns

https://github.com/DavidS/puppet-nagios

Rob Olmos
  • 2,240
  • 1
  • 15
  • 26
3

Use Zabbix. Their upcoming 2.0 release has alot of new features for things like this. The current version 1.8 has auto-registration.

The New Features doc talks about this feature:

4.2.2 Auto registration for active agents

Completely new in Zabbix 1.8, it is possible to allow active Zabbix agent auto-registration, after which server can start monitoring them. This allows to add new hosts for monitoring without any manual server configuration for each individual host.

The feature might be very handy for automatic monitoring of new Cloud nodes. As soon as you have a new node in the Cloud Zabbix will automatically start collection of performance and availability data of the host.

Jakob
  • 97
  • 4
  • Zabbix features for auto registration really poor, but they exists and may fit someone's needs. Apparently there is not much information across Internet about anyone using this feature, so almost all information you can get is in this [manual](https://www.zabbix.com/documentation/2.2/manual/discovery/auto_registration). – Dmitry Verkhoturov Apr 13 '15 at 10:33
  • Another thought: Ruby [zabbixapi](https://github.com/express42/zabbixapi) plus Puppet [custom functions](https://docs.puppetlabs.com/guides/custom_functions.html) make it possible to create host in zabbix if it's not there, it's a good solution if you already puppetized your infrastructure. – Dmitry Verkhoturov Apr 13 '15 at 10:59
1

No suggestions, but your logic is sound: In dynamic environments like the one you describe when a host comes up it needs to register with anything that needs to know about its existence (e.g. the monitoring system), and when it gets shut down it needs to un-register with things that need to know it's going away.

The question I would ask is do you need to monitor your "workhorse" servers? If they're compute nodes or similar and you know their configuration is stable & will "just work" when they get spun up monitoring the cloud itself (how many instances are running) may be just as good as tracking the individual machines, assuming your cloud provider lets you access such statistics easily.

voretaq7
  • 79,879
  • 17
  • 130
  • 214
  • I want to be able to monitor all of the servers in the environment to have a monitoring solutions that allows us to know when to add more servers using more significant metrics (app/data metrics) when scaling up or down. – wickett May 17 '10 at 15:36
  • A better way to say that, is that I want to be able to make decisions on scale based on the monitoring solution... – wickett May 17 '10 at 15:37
  • 1
    you should probably know what your scalability issues are going to be prior to moving the service to the cloud – Jim B May 17 '10 at 16:32
  • 1
    We do something fairly crude with template files and the Nagios cfg_dir option to create one config per active server. The difference to the above is that we drive the updates from the scaling management processes rather than from the servers themselves. This lets us catch issues with servers not starting correctly or servers dying ensuring they're replaced. – Dominic Cleal May 17 '10 at 18:35
  • Jim B, we do know what the scalability issues are going to be. As we scale up or down we want to make sure the monitoring solution knows that there are new targets or that we dont need to monitor unused targets. It would also be good to make add and subtraction decisions based on the monitor status. – wickett May 17 '10 at 19:54
  • m0dlx, that sounds pretty much in line with what we are wanting to do. – wickett May 17 '10 at 19:56
  • m0dlx, do you have to restart nagios every time you add a target under this scenario? – wickett May 17 '10 at 21:09
  • wickett, we perform a reload each time so it keeps persistent comments and all of its state - I've seen no issues come from it. The main disadvantage of the method I've found is that when the hosts no longer exist, historical data for the hosts is no longer available via the Nagios interface. You also need a buffer after starting servers so monitoring doesn't start immediately. – Dominic Cleal May 18 '10 at 11:22
1

If you set up nagios to load directories of configuration files using "cfg_dir" you can simply add or remove a cfg-file when a node is added or removed, and restart nagios. No real need for an API, it can be set up with a few small shell scripts and SSH with key files.

I have no experience with Zabbix but I can recommend Nagios since it is pretty easy to configure, run and customize.

Pontus
  • 273
  • 1
  • 6
1

for zabbix api, there's a commandline tool zabcon (http://trac.red-tux.net/wiki/zbx_api/interactive). it's not fully functional yet, but it should support some basic host and item operations - maybe you can work from that.

Richlv
  • 2,354
  • 1
  • 13
  • 18
1

While I have no experience with Zabbix, I'm pretty sure Nagios will not be able to do this for you without an admin intervention, let alone out of the box. The problem is that when you create a config file (to add a host) or edit/delete one, you need to restart Nagios. Upon restarting, it will take a couple minutes (depending on settings) to do the first check of the services on that hosts (checking if the host itself is up should only take a couple seconds). If these machines get added or removed several times a day, I foresee this being your first problem.

You could use a system to do the discovery for you, Nagios has plugins that do this I believe, but I've found that machine-generated cfg files are never as good as manually making them. In fact, most of these automated configs are all in one, or perhaps a handful of files. Which makes it a PITA to manage...

However, with Nagios being open source and all, I am confident that if you have the required knowledge you could code and implement a system of your own. I suspect that the machines that come up (or go down) are VM's, and that they already have NSClient or whatever agent you decide to use pre-installed. That means that if you can get a script to run whenever a machine comes up or goes down, you could create or delete a config file with the name .cfg or .cfg and then reload Nagios. Get the script to edit the hostname and ip of the host in question, and you're done! That is, of course, if the first point I made is of no importance to you...

Good luck

HannesFostie
  • 845
  • 14
  • 29
0

It's been a while since I played with Zenoss, but I think it might be what you're looking for.

Marco Ramos
  • 3,120
  • 23
  • 25
  • Thanks Marco. I will check it out. http://www.zenoss.com/product/cloud-monitoring Looks promising – wickett May 17 '10 at 15:38