1

I have a question, which good solutions (software/hardware) have been developed and applied in enterprise for online failure prediction? Zabbix, Openstb, Cacti and similar alternatives ? Can you list some more? Can you describe what advantages and disadvantages they have, spefically in failure prediction aspect ?

I want to know the disadvantages of them and make some improvement by model\algorithms. If you don't know much about the concept of Online failure prediction, please reference the following description. If you already know it, just skip it.

Online failure prediction -- It is an approach to evaluate whether an incoming failure will occur in the near future, and when the failure will occur, and in which component (maybe software or hardware) the failure will occur. It's a short-term prediction by tracking failure, detected error reporting, undetected errors' symptoms, faults's auditing (actively searching the faults, for example, search inodes' inconsistency in Linux filesystems).

A much more detailed introduction and relevant approaches is described in the paper, https://s3-us-west-2.amazonaws.com/mlsurveys/88.pdf

Thank you very much !

zhangjie
  • 119
  • 5
  • Cacti can't help there at all for sure. I can think of something which runs on top of Graphite and check various metrics for anomalies which sometimes show current problems and sometimes near \ far future ones **but** all such software is very complex to setup and to adjust so I've seen it applied only to business metrics in production, when you want to monitor free disk space better to do it directly rather than relying on some black box with many knobs. – Dmitry Verkhoturov Jan 23 '16 at 08:29
  • Maybe the failure prediction system should be more user friendly rather than academic. I think failure prediction still has some potential. Thanks very much. :) – zhangjie Jan 23 '16 at 16:24

1 Answers1

1

Comparison of monitoring systems: https://en.wikipedia.org/wiki/Comparison_of_network_monitoring_systems

I don't think, that some monitoring system has failure prediction out of the box. Your provided paper is too academic. You can still build it on the top of some monitoring system, which will provide data/events/failures for your failure algorithm predictions.

Some monitoring systems have:

  • metric prediction (trend prediction). It's not a failure prediction. Nice semi academic paper has Zabbix about it - Zabbix prediction.

  • anomaly detection - again it's not a prediction, it's detection. The most famous OSS for anomaly detection is Skyline. RRD based systems (Cacti) use RRD Holt Winter algorithm. Also Graphite has some math functions, which can be used for anomaly detection.

If you want to implement/improve failure detection, then make it generic:

  • input layer - some plugin concept, so user should be able to use/write own plugin, which will pull data from plugin specific monitoring system
  • failure detection layer - there is many algorithms, so each of them should be configurable
  • output layer - similar to input layer, so event about predicted failure can go back to monitoring system or to some another alerting system

Please make it user (not academic) friendly and use Github. Ping me, when you need to test it. :-)

Jan Garaj
  • 879
  • 1
  • 7
  • 15
  • Your answer benefits me a lot. I will try to learn details about zabbix and do some implementation work. Thanks for your enthusiasm : ) – zhangjie Jan 23 '16 at 16:20