0

I want to use monit to kill a process that uses more than X% CPU for more than N seconds.

I'm using stress to generate load to try a simple example.

My .monitrc:

check process stress
    matching "stress.*"
    if cpu usage > 95% for 2 cycles then stop

I start monit (I checked syntax with monit -t .monitrc):

monit -c .monitrc -d 5

And I launch stress:

stress --cpu 1 --timeout 60

Stress shows in top as using 100 %CPU.

I'd expect monit to kill stress in about 10 seconds, but stress completes successfully. What am I doing wrong?

I also tried monit procmatch "stress.*", which shows two matches for some reason. Maybe that's relevant?

List of processes matching pattern "stress.*":
stress --cpu 1 --timeout 60
stress --cpu 1 --timeout 60
Total matches: 2
WARNING: multiple processes matched the pattern. The check is FIRST-MATCH based, please refine the pattern

EDIT: Tried e.lopez's method

I had to remove the start statement from .monitrc because it was causing a error in monit ('stress' failed to start (exit status -1) -- Program /usr/bin/stress timed out and then a zombie process).

So launched stress manually:

stress -c 1
stress: info: [8504] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

The .monitrc:

set daemon 5
check process stress
    matching "stress.*"
    stop program = "/usr/bin/pkill stress"
    if cpu > 5% for 2 cycles then stop

Launched monit:

monit -Iv -c .monitrc
Starting Monit 5.11 daemon
'xps13' Monit started
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check skipped (initializing)
'stress' 
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check succeeded [current cpu usage=0.0%]
'stress' process is running with pid 8504
'stress' zombie check succeeded [status_flag=0000]
'stress' cpu usage check succeeded [current cpu usage=0.0%]
'stress' process is not running
'stress' trying to restart
'stress' start skipped -- method not defined

Monit sees the right process (pids match), but sees 0% usage (stress is using 1 cpu at 100% per top). I killed stress manually, which is when monit says the process is not running (at the end, above). So, monit is monitoring the process fine, but isn't seeing the right cpu usage.

Any ideas?

capitalistcuttle
  • 1,709
  • 2
  • 20
  • 28

2 Answers2

2

Note that if your system has many cores, the fact that you stress just one of them (cpu 1) will not stress the whole system. In my tests with a i7 Processor, stressing the CPU 1 to 95% just stresses the total System to 12.5%.

Depending on the number of cores, you might want to use accordingly:

monit -c X

where X is the amount of cores you want to stress.

But this is not your main issue. Your problem is that you do not provide monit with a stop instruction for the stress programm. Look at this:

check process stress
matching "stress.*"
start program = "/usr/bin/stress -c 1" with timeout 10 seconds
stop program = "/usr/bin/pkill stress"
if cpu > 5% for 2 cycles then stop

You are missing at least the "stop" line, where you define the command which will be used by monit to actually stop the process. As stress is not a service, you might want to use the pkill instruction in order to kill the process.

I tested the above configuration successfully. Output of the monit.log:

[CET Nov  5 09:03:02] info     : 'stress' start action done
[CET Nov  5 09:03:02] info     : 'Overlord' start action done
[CET Nov  5 09:03:12] info     : Awakened by User defined signal 1
[CET Nov  5 09:03:22] error    : 'stress' cpu usage of 12.5% matches resource limit [cpu usage<5.0%]
[CET Nov  5 09:03:32] error    : 'stress' cpu usage of 12.4% matches resource limit [cpu usage<5.0%]
[CET Nov  5 09:03:32] info     : 'stress' stop: /usr/bin/pkill

So: Assuming you are just willing to test, hence the CPU-Usage is not relevant, just use the confg I provided above. Once you are sure your config works, adjust the resource limits for the processes you would like to monitor in a production environment.

Always have at hand: https://mmonit.com/monit/documentation/

Hope it helps.

Regards

Eduardo López
  • 762
  • 5
  • 14
  • Many thanks for the detailed response. Unfortunately, still running into problems. Please see edits. If this is more trouble than it's worth, I'd totally understand. Beginning to think just writing some simple utility myself would be easier than debugging monit. – capitalistcuttle Nov 07 '15 at 23:42
  • It somehow does not make sense at all that the provided configuration does not work in your environment... I assume you refreshed monit's configuration by doing "sudo monit -t && sudo service monit restart && sudo monit start all" after updating the configuration file? In that case, I'd suggest reinstalling monit. – Eduardo López Nov 15 '15 at 18:27
  • `-monit -c X` `+stress -c X` – Michael Shigorin Jul 30 '20 at 17:53
1

I think the reason why you're seeing 0% cpu is because stress -c 1 creates two processes - one "worker" process which will create the load and second mostly idle background process (open htop and filter for stress to see the second process).

If a regex matches more than one process, monit will pick the process with the longest uptime (check the monit doc) - for me the background process always had a longer uptime than the "worker" process.

You can mitigate this by using stress-ng. Here the "worker" process has a distinct name so there is no ambiguity when matching.

stress-ng -c 1

works with the following .monitrc file

set daemon 5
check process stress
    matching "stress-ng-cpu"
    stop program = "/usr/bin/pkill stress-ng"
    if cpu > 5% for 2 cycles then stop