3

I use the heart program to restart an Erlang node when it becomes unresponsive. However, I am finding it hard to understand why the node freezes. SASL logs don't show any errors, and my own logs don't seem to show anything remarkable happening at those times. Can anybody give advice on debugging this sort of thing?

Alexey Romanov
  • 167,066
  • 35
  • 309
  • 487

3 Answers3

2

By default the heart program issues a SIGKILL to kill off the unresponsive VM so it can quickly start a new one. This makes getting any useful information about the VM pretty much impossible. Something I've tried in the past is to patch the heart program to avoid the hard kill and instead get the VM to create a crash dump and a coredump. I used a patch like this (this one is for Erlang/OTP R14B02):

--- erts/etc/common/heart.c.orig 2011-04-17 12:11:24.000000000 -0400
+++ erts/etc/common/heart.c 2011-04-17 12:12:36.000000000 -0400
@@ -559,10 +559,11 @@
     int res;
     if(heart_beat_kill_pid != 0){
    pid = (pid_t) heart_beat_kill_pid;
-   res = kill(pid,SIGKILL);
+   res = kill(pid,SIGUSR1);
+   sleep(4);
    for(i=0; i < 5 && res == 0; ++i){
        sleep(1);
-       res = kill(pid,SIGKILL);
+       res = kill(pid,i < 2 ? SIGQUIT : SIGKILL);
    }
    if(errno != ESRCH){
        print_error("Unable to kill old process, "

As you can see, with this patch heart will first issue a SIGUSR1 to try to get the VM to create a crash dump. Since this can take awhile, heart then sleeps for 4 seconds. You might have to increase this sleep time if you're not getting full crash dumps. After that, heart then tries twice to issue a SIGQUIT with the hope of getting a coredump, and if that fails, issues a SIGKILL.

Note that this patch will slow down heart's VM restart due to the time required to wait for the crash dumps and coredumps. If you use it in production, be aware of this limitation.

Steve Vinoski
  • 19,847
  • 3
  • 31
  • 46
1

If you have any idea of why it is freezing you could try to trace the module using dbg.

http://www.erlang.org/doc/man/dbg.html

In short try

dbg:tracer(), dbg:p(all,c), dbg:tpl(Module, Function, x).

If you want to stop this tracing issue

dbg:ctpl()

See documentation for more info.

Note: Change Module and Function to whatever you want to trace, leave x as it is. You can also skip Function and only give Module, x.

Warning: Running this on a live system can be dangerous as the amount of information that is going to be printed to the shell can be enormous.

Sedrik
  • 2,161
  • 17
  • 17
  • 1
    Unfortunately, this won't work as I can't predict _when_ it'll freeze. – Alexey Romanov Apr 13 '11 at 14:52
  • 1
    You can set up `dbg` to trace to a file. Might work letting it run for a while and only examine the traces once you detect the failure scenario. – Adam Lindberg Apr 14 '11 at 07:17
  • Another possible reason for it to hang is if you have any receive clauses in your code that do not include timeouts. Looks for that and add a timeout clause to them and log the error. – Sedrik Apr 15 '11 at 09:21
  • Wouldn't that only hang one process, and not the entire node? – Alexey Romanov Apr 15 '11 at 10:47
  • Thats true, how do you know that the node has hung then? What are your symptoms? – Sedrik Apr 15 '11 at 14:15
  • The symptom is that `heart` is restarting the node :) So apparently it isn't responding to messages. I later see these restarts in the logs and see absence of other records some time before restarts. – Alexey Romanov Apr 17 '11 at 19:56
  • Are youre logs overwritten at each restart then? In one of my clients system I have instructed the heart command to make a copy of my sasl log when it restarts. – Sedrik Apr 18 '11 at 09:51
1

You could try to call erlang:halt/1 from your HEART_COMMAND thus creating a crash dump from the unresponsive node.

You can try using the erl_call tool with e.g. -a erlang halt 123.

If the erlang node can't respond to this is also interesting information.

Did you try increasing `HEART_BEAT_TIMEOUT? Maybe the node is just bogged down a bit an misses the timeout but doesn't freeze.

Peer Stritzinger
  • 8,232
  • 2
  • 30
  • 43