What's a reliable way to check whether or not my script is up AND not frozen?

Question

I have a python (or ruby, doesn't really matter) script on a server which has to be reliable and run all the time. And if something happens and it crashes or gets frozen I need to know about that immediately. Previously I was thinking about another "script" such as a cron job which would check it up every minute by the means of Linux -- whether or not it's in the list of the active processes. However, now I think that even if it's the list of the active processes, it still might be frozen (it hasn't crashed yet, but it's about to).

Isn't that right? If so, I'm thinking of having it save some "heart-beat" data into a file every minute, because it's more reliable way to know whether or not it's up AND whether or not it's frozen, because if it's frozen it can't write into a file but still can be in the memory.

Your suggestions, should I go with that? Or just checking if its process in the memory (in the list of active processes) is perfectly enough?

@UriAgassi, what do you mean? I think it's the same approach as mine, but in my case I'll save more data into the file. — , Feb 14 '16 at 11:16
Yes, it is a similar approach. Since there is no functional need for any data to save, `touch` does _exactly_ what you need, and nothing more. If anything, it will save you maintenance, such as clean-up of files which are getting bigger all the time... — Uri Agassi, Feb 14 '16 at 11:29
@UriAgassi, I mean, do I have to, indeed, save a file instead of just checking whether or not the script is the list of the processes? — , Feb 14 '16 at 12:02
If it may freeze, as you say - it may be listed as a running process, but not do the work you expect it to... — Uri Agassi, Feb 14 '16 at 12:29
Create a main program thread. Check these program errors. (stop, start, kill). If a dependent application must process each will follow your event specialist. To continue with this system requires additional resources and permissions. clear ? @OskarK. — dsgdfg, Feb 14 '16 at 13:28
@dsgdfg, and how would you check "stop, start, kill" programm errors. how do you define them? — , Feb 14 '16 at 13:30
If your script wrote the Unix time (seconds since the Epoch) in a file every minute, anyone else who is interested could look in the file and see if the time in there is more than 2 minutes adrift from the current time... I mean to over-write the file, not append to it so it grows. — Mark Setchell, Feb 14 '16 at 16:35

score 2 · Answer 1 · answered Feb 14 '16 at 16:56

If there are bad consequences when the script is not running (If there weren't, you probably wouldn't care, would you?), it might be most reliable to check for distinct symptoms of those consequences.

For example, if the script is a web server, have a monitoring service make requests to it and notify you whenever that fails.

If the bad consequences can be observed remotely or even off-site, have the monitoring remote, or if possible, off-site from the machine running your script. Why? If the consequences occur because your script stopped running because the machine running it died ... you wouldn't get notified if it was that same machine's task to notify you. If it's a different machine, you'll be made aware of the situation. Unless the data center burnt down. Then your monitoring service needs to be in a different data center for you to get notified.

There are paid and free monitoring service offers for publicly accessible servers, e.g. Uptime Robot for web servers, in case you don't want to develop and host the monitoring yourself.

What's a reliable way to check whether or not my script is up AND not frozen?

1 Answers1