3

I am trying to use waitpid() for waiting for individual threads instead of processes. I know that pthread_join() or std::thread::join() are the typical ways for waiting for a thread. In my case, however, I am developing a monitoring application that forks and executes (via execv) a program which in turn, spawns some threads. So, I cannot join the threads from the monitoring application, since they belong to a different process and I do not have access to the source code. Still, I want to be able to wait for these individual threads to finish.

For an easier visualization of what I am trying to achieve, I include a drawing, hoping to make it much more clear:

enter image description here

Everyhing works fine when I use processes, but waitpid does not wait for threads. Basically, waitpid returns -1 right after it is called (the thread is still running at that time for some more seconds).

Documentation for waitpid states:

In the Linux kernel, a kernel-scheduled thread is not a distinct construct from a process. Instead, a thread is simply a process that is created using the Linux-unique clone(2) system call; other routines such as the portable pthread_create(3) call are implemented using clone(2). Before Linux 2.4, a thread was just a special case of a process, and as a consequence one thread could not wait on the children of another thread, even when the latter belongs to the same thread group. However, POSIX prescribes such functionality, and since Linux 2.4 a thread can, and by default will, wait on children of other threads in the same thread group.

That description only considers waiting from a thread to children of other threads (in my case I want to wait for threads children of another process). But, at least, it shows that waitpid is thread-aware.

This is what I am using for waiting for the threads:

std::vector<pid_t> pids;

/* fill vector with thread IDs (LWP IDs) */

for (pid_t pid : pids) {
    int status;
    pid_t res = waitpid(pid, &status, __WALL);
    std::cout << "waitpid rc: " << res << std::endl;
}

This code works for waiting for processes, but it fails for waiting for threads (even if __WALL flag is used).

I am wondering whether it is actually possible to wait for a thread by using waitpid. Is there any other flag that I need to use? Could you point me to any document where it is explained how to wait for threads of another process?

For reference, the code that I am using for creating the threads is:

static void foo(int seconds) {
    int tid;
    {
        std::lock_guard<std::mutex> lock(mutex);
        tid = syscall(__NR_gettid);
        std::cout << "Thread " << tid << " is running\n";
        pids.push_back(tid);
        pids_ready.notify_all();
    }

    for (int i = 0; i < seconds; i++)
        std::this_thread::sleep_for(std::chrono::seconds(1));
}

static void create_thread(int seconds) {
    std::thread t(foo, seconds);
    threads.push_back(std::move(t));
}

std::vector<pid_t> create_threads(int num, int seconds) {
    for (int i = 0; i < num; i++)
        create_thread(seconds);

    std::unique_lock<std::mutex> lock(mutex);
    pids_ready.wait(lock, [num]() { return pids.size() == num; });

    return pids;
}

I am using GCC 4.6 and Ubuntu 12.04.

UPDATE: I managed to make it work by using ptrace:

ptrace(PTRACE_ATTACH, tid, NULL, NULL);
waitpid(tid, &status, __WALL);
ptrace(PTRACE_CONT, tid, NULL, NULL);

while (true) {
    waitpid(tid, &status, __WALL);
    if (WIFEXITED(status)) // assume it will exit at some point
        break;
    ptrace(PTRACE_CONT, tid, NULL, NULL);
}

This code works both when T1, T2, ..., Tn are processes and when they are threads.

I have an issue, however. If I try this monitoring tool with multithreaded C++ applications, everything works fine. But the original intent was to use this monitoring tool with a Java application spawning several threads. When using a multithreaded Java application the waitpid in the loop wakes up many times per second (the child thread is stopped by a SIGSEGV signal). This seems to be related to the fact that Java is using SIGSEGV for its own purposes (see this question, and this post).

All those wake-ups end up slowing down the application a lot. So am I wondering whether there is some flaw in my solution and whether there is a way to make it work with Java applications.

Community
  • 1
  • 1
betabandido
  • 18,946
  • 11
  • 62
  • 76
  • 1
    Although not familar in using waitpid() to wait for threads to terminate, I'd interpret the quote from the man 2 waitpid you posted in such a way that one could use waitpid(..., __WALL) **out of the process that created the threads** one wants to monitor, as they are children of this creating process (main thread). And as `waitpid()` only waits for children and not for grand children, I assume you're on the wrong trail. – alk Jul 02 '12 at 14:45
  • @alk Yes, you may actually be right. In the `waitpid` man page, however, it states `The following Linux-specific options are for use with children created using clone(2)...` (referring to `__WALL`, etc.). So, somehow, it may seem like it should be possible to wait for a thread. Anyway, I will keep looking for a solution, while hoping someone has already done this before and posts a solution :) – betabandido Jul 02 '12 at 14:54
  • Why not test the `/proc/PID/task/TID/` scanning approach I outlined? It is Linux-only, and you don't get notifications, but scanning `/proc/PID/task/` is a very lightweight operation. Do you want an example? – Nominal Animal Jul 04 '12 at 21:23

5 Answers5

3

I'm a bit confused about your claim that everything "works fine" for processes. waitpid can only wait for your own child processes, not arbitrary other processes, and in fact it's almost surely a bug to ever use a process id except when it's your own child process.

Rather than looking for ugly hacks to do something that's not intended to be possible, why not just fix your design to use some proper inter-process communication mechanism so that threads can signal to the other process when they're done? Or put the whole program in a single process (with multiple threads) rather than splitting your work across multiple processes and threads?

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • I updated the question with a plot, hoping to make my point clearer. As you can see, my monitoring application executes another program. That program then creates threads and I want to wait for them. I cannot modify that program so, I cannot do what you are suggesting (which I agree it would be the right thing to do). If instead of a program that spawns threads, I monitor a program that spawns processes, then the `waitpid` works (i.e., it waits until the children processes finish). – betabandido Jul 02 '12 at 14:20
  • It sounds then like you want to use `ptrace`, but this will have considerable runtime overhead. – R.. GitHub STOP HELPING ICE Jul 02 '12 at 14:21
  • I have used `ptrace` just once or twice, so I am not that familiar with it. Do you know whether that runtime overhead will only occur during thread creation/finishing or it will impact the total execution time for all the threads? – betabandido Jul 02 '12 at 14:24
  • Actually, I used `ptrace` in the past in order to attach to processes that were running in the system prior to my monitoring software to start. In that way, I could wait for them to finish. Do I need to use `ptrace` too if I need to wait for processes/threads that are "grandchildren" of my monitoring application? For processes, it seems `waitpid` suffices, but I cannot make it work for threads. – betabandido Jul 02 '12 at 14:27
  • Yes. The only reason `ptrace` allowed you to wait for the children of the process you traced is that `ptrace` allows you to act on behalf of the process you're tracing. When you do this, of course, you need to be sure not to interfere with the process's waiting for its own children; I'm not sure what's involved in that. I think it would be a lot safer to trace *everything* you want and get the trace notification that a process/thread has exited than to call `waitpid`. – R.. GitHub STOP HELPING ICE Jul 02 '12 at 14:33
3

You cannot wait on threads in other processes in Linux except on the thread group leader (a.k.a as the main thread).

sys_waitpid in modern Linux kernels is implemented as a wrapper around sys_wait4 which in turn calls do_wait. do_wait does the heavy lifting of waiting on processes (threads are just special kind of processes). It only iterates over the known children of the current task and, if __WNOTHREAD is NOT specified, over the children of the other threads in the same thread group.

The funny moment here is that creating a thread using the clone syscall actually sets the parent of the newly created thread to the parent of the process that was cloned but this parent is in no way notified that it has just aquired a new child (it is not registered in the lists of its task structure). It will also not receive SIGCHLD when the clone exists since the exit signal of threads is set to -1 by copy_process - the function that actually copies processes.

The rationale behind this is quite simple: waiting is a single shot operation - once a wait has been performed and completed, the waited process is no longer existent. If you allow for another process to wait on a thread or a child of the current process, you take from the current one the ability to perform the wait on its children. You also create a possible race condition and would definitely not enjoy pthread_join() failing because some other process has waited on one of your threads, would you?

Hristo Iliev
  • 72,659
  • 12
  • 135
  • 186
1

Ok, this is not a solution, but an explanation why I doubt there is a solution using waitpid():

1.1 Under Linux threads created using clone() are children of the process having created them.

1.2 Following this, threads are grand-children of a process (A) that created a process (B) which in turn had created the threads.

2 waitpid() does not trigger on a signal SIGCHLD for any terminated grand-child.

All this together explains why your approach does not work.

alk
  • 69,737
  • 10
  • 105
  • 255
0

In Linux, you can monitor the /proc/PID/task/ directory, which contains a directory for each thread belonging to process PID.

Unfortunately the inotify interface does not seem to help here, so you'd have to repeatedly scan the /proc/PID/task/ directory for thread IDs. Fortunately, that does seem to be minimal cost, especially if you only do the scan a dozen or at most a few dozen times a second. Note that the directory will vanish when the thread exits, not when the thread is reaped.

The one thread with TID==PID is the original process in Linux. Other threads will get TIDs in increasing order (although they will wrap around eventually, of course). Note that TIDs have no relation to pthreads threads. To find out which TID would map to which pthread_t, the running thread would have to call gettid() (in practice, syscall(SYS_gettid)); otherwise it is very difficult to tell which thread is which based on TID or /proc/PID/task/TID/ contents alone. If you are only interested in thread turnover (if/when created and/or exited), then this interface is a lot more efficient than e.g. ptrace, although there is a latency to the thread exit detection (which depends on your directory scanning interval).

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
0

as far as i know , waitpid is only used to deal with a specified terminated subpro . And it is more secure than wait when there are many subpro that are waited to be dealed at one time .

mr-yu
  • 1