3

I have to migrate a C-program from OpenVMS to Linux, and have now difficulties with a program generating subprocesses. A subprocess is generated (fork works fine), but execve fails (which is correct, as the wrong program name is given).

But to reset the number of active subprocesses, I afterwards call a wait() which does not return. When I look at the process via ps, I see that there are no more subprocesses, but wait() does not return ECHILD as I had thought.

while (jobs_to_be_done)
{
   if (running_process_cnt < max_process_cnt)
   {
      if ((pid = vfork()) == 0)
      {
         params[0] = param1 ;
         params[1] = NULL ;
         if ((cstatus = execv(command, params)) == -1)
         {
            perror("Child - Exec failed") ;   // this happens
            exit(EXIT_FAILURE) ;
         }
      }
      else if (pid < 0)
      {
         printf("\nMain - Child process failed") ;
      }
      else
      {
         running_process_cnt++ ;
      }
   }
   else   // no more free process slot, wait
   {
      if ((pid = wait(&cstatus)) == -1)   // does not return from this statement
      {
         if (errno != ECHILD)
         {
            perror("Main: Wait failed") ;
         }
         anz_sub = 0 ;
      }
      else
      {
         ...
      }
   }
}

Is the anything that has to be done to tell the wait-command that there are no more subprocesses? With OpenVMS the program works fine.

Thanks a lot in advance for your help

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • if (as you say in comments on answers) you changed `vfork` to `fork` and still get the problem, I'd say that it's most likely that you are misinterpreting what is happening. Try reducing this to a minimal, complete, compilable, verifiable example (http://stackoverflow.com/help/MCVE). – davmac Jul 21 '15 at 10:55

2 Answers2

2

I don't recommend using vfork these days on Linux, since fork(2) is efficient enough, thanks to lazy copy-on-write techniques in the Linux kernel.

You should check the result of fork. Unless it is failing, a process has been created, and wait (or waitpid(2), perhaps with WNOHANG if you don't want to really wait, but just find out about already ended child processes ...) should not fail (even if the exec function in the child has failed, the fork did succeed).

You might also carefully use the SIGCHLD signal, see signal(7). A defensive way of using signals is to set some volatile sigatomic_t flag in signal handlers, and test and clear these flags inside your loop. Recall that only async signal safe functions (and there are quite few of them) can be called -even indirectly- inside a signal handler. Read also about POSIX signals.

Take time to read Advanced Linux Programming to get a wider picture in your mind. Don't try to mimic OpenVMS on POSIX, but think in a POSIX or Linux way!

You probably may want to always waitpid in your loop, perhaps (sometimes or always) with WNOHANG. So waitpid should not be only called in the else part of your if (running_process_cnt < max_process_cnt) but probably in every iteration of your loop.

You might want to compile with all warnings & debug info (gcc -Wall -Wextra -g) then use the gdb debugger. You could also strace(1) your program (probably with -f)

You might want to learn about memory overcommitment. I dislike this feature and usually disable it (e.g. by running echo 0 > /proc/sys/vm/overcommit_memory as root). See also proc(5) -which is very useful to know about...

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • Hello Basile Thank you for your answer. I replaced vfork() with fork(), but the result was still the same. I then put some sleep-commands into the job, and saw, that fork() worked fine, the execv() works as well, but after the subprocesses terminate, wait() does not return. – Jörg Mohren Jul 21 '15 at 08:04
  • I'd say `posix_spawn` is the better option here, if you can get it. `fork`/`exec` is unfortunately unreliable on common systems, especially `Cygwin` and most (non-Linux) POSIX implementations not allowing virtual memory to be over-committed. (Off-topic but I've been bitten in the past.) – doynax Jul 22 '15 at 12:14
  • But the question is about Linux.... When is `fork/exec` "unreliable"? Of course, it can fail. – Basile Starynkevitch Jul 22 '15 at 12:42
  • @Basile Starynkevitch: On Linux? As far as I'm aware `fork+exec` is close to identical to `vfork+exec` there, assuming no illegal `vfork` behavior. Except for a bit of time copying the virtual memory metadata. Where I've personally run into problems is on `Cygwin`, which uses some nasty and deeply unreliable hacks to fake `fork` at the user level, as well as an old `Solaris` server which would occasionally run out of memory while forking (the system promised to back up the forked memory with swap space in case it should be needed, as opposed to the `Linux` roulette.) – doynax Jul 22 '15 at 17:27
1

From man vfork:

The child must not return from the current function or call exit(3), but may call _exit(2)

You must not call exit() when the call to execv (after vfork) fails - you must use _exit() instead. It is quite possible that this alone is causing the problem you see with wait not returning.

I suggest you use fork instead of vfork. It's much easier and safer to use.

If that alone doesn't solve the problem, you need to do some debugging or reduce the code down until you find the cause. For example the following should run without hanging:

#include <sys/wait.h>

int main(int argc, char ** argv)
{
    pid_t pid;
    int cstatus;
    pid = wait(&cstatus);
    return 0;
}

If you can verify that this program doesn't hang, then it must be some aspect of your program that is causing a hang. I suggest putting in print statements just before and after the call to wait.

davmac
  • 20,150
  • 1
  • 40
  • 68
  • Hello I replaced the exit()-statements in the subprocess by _exit() - did not work neither. I then replaced the wait() by a loop of waitpid() calls for the actual subprocesses. At least these statements return, but all with return value 0 and errno ECHILD. The subprocesses are not yet finished at the time waitpid() is called. – Jörg Mohren Jul 21 '15 at 11:15
  • So my problem is that waitpid(pid, &cstatus, 0) returns, but does not deliver the pid of the process terminated nor the correct exit status of the subprocess. – Jörg Mohren Jul 21 '15 at 11:28
  • @JörgMohren see example in comment above. Does it work for you? I suggest printing the child pid immediately after `fork` and also before you `wait`/`waitpid`. Maybe there's a memory corruption problem elsewhere in your program. Or, you are waiting twice for the same child process. – davmac Jul 21 '15 at 11:43
  • Hello davmac Thanks a lot; your version works for me as well. Now I only have to find out which of the parts in my more complex program causes the problem. But this should be possible! – Jörg Mohren Jul 21 '15 at 11:49
  • Hello I think I have found the point where it goes wrong. I copied davmac's code into a function in my code, calling it from several positions. Everything goes fine up to the moment when I connect my program to the underlying Oracle database. After the connect, it does not work anymore. I shall present this problem to Oracle (if none of you has a good idea what the reason might be) – Jörg Mohren Jul 21 '15 at 12:51
  • Hmm, I wonder if the Oracle client library both (a) runs a persistent child process and (b) establishes a SIGCHLD signal handler or runs a thread that calls `wait`? – davmac Jul 21 '15 at 12:53
  • (or maybe it sets the SIGCHLD handler to SIG_IGN?) – davmac Jul 21 '15 at 12:55
  • I now found a way as a possible solution to the problem. If I call waitpid(0, &cstatus, 0), the program only waits for my generated subprocesses (i.e., the subp generated by DB-connect is in another group). But, what I find a bit surprising, waitpid() waits, but returns ECHILD (though I saw that it really waits until the end of the subprocess). I could get the return state also by analyzing, if any errors have occurred, but this does not cover system problems that made the process crash. Any ideas on this? Besides, I tried a signal(SIGCHLD, SIG_DFL) before the subprocess handling. Didn't help. – Jörg Mohren Jul 22 '15 at 06:26
  • Sorry, my fault. The command 'signal(SIGCHLD, SIG_DFL)' DID help (I used the wrong executable for miy test). Thanks a lot to davmac for your great idea. – Jörg Mohren Jul 22 '15 at 06:32