1

I'm starting a job managing a VSP, and one of the problems I'm trying to solve is with mailman. LFD is sending emails reporting problems(at least 6 every 10 minutes), all of them with content quite similar. This is the content of one of them:

    Time:         Mon Feb  5 15:10:42 2018 -0500
    Account:      mailman
    Resource:     Process Time
    Exceeded:     433234 > 2000 (seconds)
    Executable:   /usr/bin/python
    Command Line: /usr/bin/python /usr/local/cpanel/3rdparty/mailman/bin/qrunner --runner=RetryRunner:0:1 -s
    PID:          20186 (Parent PID:20170)
    Killed:       No

I don't want LFD to stop reporting (which I know how to do), but I want to solve the cause of the problem. Could anyone point me on the right direction?

Fahed
  • 121
  • 1
  • 9

1 Answers1

1

Your monitoring system expects that the process "qrunner" should only run for a maximum of 2000 seconds. However, that process is a part of Mailman that should be started at boot and keep running. Setting a limit on the number of seconds such a process should run is not a good idea.

In other words, you should fix the configuration of your monitoring software. Generally speaking, any automated report that does not require action should not be created in the first place; it'll desensitize you to error reports that are actually useful.

Jenny D
  • 27,780
  • 21
  • 75
  • 114
  • Thanks, I have found I can make `csf.pignore` ignore the process. Do you consider it a healthy way to go? – Fahed Feb 07 '18 at 17:44
  • @Fahed Yes, that's what I would recommend. As a new sysadmin taking over an existing system, figuring out what alarms are actually important and thus should be silenced, and what things *should* be monitored but aren't, would usually be a good way to get an overview of how the system works. It's hard when you're also new to the software running on it, of course, but it'll make your life a lot easier in the future - and figuring out what alarms are important will help you learn the software as well. – Jenny D Feb 08 '18 at 08:28
  • @Fahed And [here's a horror story](https://medium.com/backchannel/how-technology-led-a-hospital-to-give-a-patient-38-times-his-dosage-ded7b3688558) about too many error messages ending up leading to a death. Hopefully your system won't have quite so drastic an impact, but the story contains an explanation of "alarm fatigue" which is an actual thing and may lead to sysadmin missing the messages that are actually important. – Jenny D Feb 08 '18 at 08:31
  • You're welcome! Back some 20 years ago when I was a new sysadmin, I got a lot of help from more experienced peers. I am happy to be able to pay it forward. – Jenny D Feb 09 '18 at 08:53