4

I'm working on a project that requires to execute some processes inside a docker container. I want to handle the case when the process doesn't terminate on time (let's say within 10 s).

I'm using this DockerClientpp library for managing the containers that basically just makes HTTP reqs to the Docker socket. Everything is fine up to this point.

To stop a container that is taking too long I'm using a separate thread. The problems is that I was able to implement it using ptheads but I cannot find a way using std::thread and lambas

Here is my working implementation with pthread

void *ContainerManager::spawnKiller(void *ref) {
    ContainerManager *self = (ContainerManager *)ref;
    std::unique_ptr<DockerClientpp::DockerClient> dc(new DockerClientpp::DockerClient());

    std::cout << "[slave]forceStop(): Waiting " << self->timeOut << " before stopping " << self->activeId << std::endl;
    std::this_thread::sleep_for(std::chrono::seconds(self->timeOut));
    try {
        dc->stopContainer(self->activeId);
        std::cout << "[slave]forceStop(): Container will be force-stopped" << std::endl;
    } catch(std::exception &e) {
        // container has already been destroyed
        std::cout << "[slave]forceStop(): Error => " << e.what() << std::endl;
    }
    pthread_exit(0);
}

void ContainerManager::execute() {
    pthread_t killerId;
    pthread_create(&killerId, nullptr, &(ContainerManager::spawnKiller), (void *)this);
    pthread_detach(killerId);
}

And here is my std::thread and lambda implementation that fails with SEGFAULT as soon as I try to detach the thread.

void ContainerManager::execute() {
    std::thread([this]() {
        std::this_thread::sleep_for(std::chrono::seconds(timeOut));
        try {
            dc->stopContainer(activeId);
            std::cout << "[slave]forceStop(): Container will be force-stopped" << std::endl;
        } catch(std::exception &e) {
            // container has already been destroyed
            std::cout << "[slave]forceStop(): Error => " << e.what() << std::endl;
        }
    }).detach();
}

And this is what gdb shows

Thread 1 "test" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00000000007c6801 in std::thread::detach() ()
#2  0x0000000000410785 in ContainerManager::execute (this=0x7fffffffe2a0, processName=...)
    at ../container_manager.cpp:223
#3  0x0000000000412c99 in ContainerManager::executeNew (this=0x7fffffffe2a0, processName=..., 
    replace=false, language=@0x7fffffffe020: ContainerManager::GO) at ../container_manager.cpp:336
#4  0x00000000004094a9 in main () at test.cpp:36

I tried with a regular function instead of a lamba, I tried capturing the parameters, I also tried passing the parameters as arguments but I'm stuck.

I haven't tried allocating the thread dynamically with new thread(...) but from my understanding even if the std::thread variable goes out of scope, the thread is still alive.

Do you have any suggestion on what I'm doing wrong? I feel like I'm really missing something about std::thread and lambda.

The execute method is a method of the class ContainerManager that it's guaranteed not to go out of scope before the spawned thread has terminated, also the variables that I use (timeOut and activeId are fields of the object)


EDIT: It really seems there is something wrong with detach()

If I run this

void ContainerManager::execute() {
    int *t = new int;
    *t = timeOut;
    std::string *s = new std::string;
    *s = activeId;
    std::thread x([&t, &s]() {
        std::cout << "LOL" << std::endl;
        std::this_thread::sleep_for(std::chrono::seconds(*t));
        std::unique_ptr<DockerClientpp::DockerClient> _dc(new DockerClientpp::DockerClient());
        try {
            _dc->stopContainer(*s);
            std::cout << "[slave]forceStop(): Container will be force-stopped" << std::endl;
        } catch(std::exception &e) {
            // container has already been destroyed
            std::cout << "[slave]forceStop(): Error => " << e.what() << std::endl;
        }
    });
    std::cout << "Detaching" << std::endl;
    if(x.joinable()) {
        std::cout << ".. in a moment" << std::endl;                                                                             
        x.detach();
    }
}

I get this output

Detaching
.. in a moment
Segmentation fault (core dumped)

EDIT 2 I tried running this code on my laptop and everything works fine

void ContainerManager::execute() {
    // activeId and timeOut are fields of the ContainerManager object
    std::thread([this]() {
        std::this_thread::sleep_for(std::chrono::seconds(timeOut));
        std::unique_ptr<DockerClientpp::DockerClient> dc(new DockerClientpp::DockerClient());
        try {
            dc->stopContainer(activeId);
            std::cout << "[slave]forceStop(): Container will be force-stopped" << std::endl;
        } catch(std::exception &e) {
            // container has already been destroyed
            std::cout << "[slave]forceStop(): Error => " << e.what() << std::endl;
        }
    }).detach();
}
fedemengo
  • 626
  • 2
  • 7
  • 24
  • You're capturing the `this` pointer. Are you sure that object isn't going out of scope? – Cruz Jean Jul 19 '19 at 05:18
  • @CruzJean Yeah, I also tried just capturing **timeOut** and **activeId** but I have the same problem – fedemengo Jul 19 '19 at 05:21
  • What is the lifetime of `dc`, in second case ? You should shaw second version of `spawnKiller` also. – rafix07 Jul 19 '19 at 05:24
  • @CruzJean the thread doesn’t execute a single line of the function, I get segfault right away – fedemengo Jul 19 '19 at 05:26
  • @rafix07 is a field of the containerManager (the captured **this**) that doesn’t go out of scope – fedemengo Jul 19 '19 at 05:27
  • @fedemengo did you already try running with the sanitizer enabled? (also all warnings) – Cruz Jean Jul 19 '19 at 05:35
  • 2
    I don't see `dc` getting allocated at all in the lambda version, whereas in the pthreads version, it gets specifically allocated by the thread. – selbie Jul 19 '19 at 05:46
  • The `ContainerManager::execute()` using `std:thread` seems to be missing a `}`. Are you sure there's not something else important missing from the posted code? – Michael Burr Jul 19 '19 at 06:22
  • @selbie it's a field member of the class that the thread should be able to use by capturing `this` – fedemengo Jul 19 '19 at 14:53
  • @MichaelBurr yeah, the code is exactly the one I'm running, I missed the `}` when formatting here – fedemengo Jul 19 '19 at 14:54
  • @rafix07 I'm using `-fsanitize=address` and `-Wall -Werror` and g++ doesn't report any problem – fedemengo Jul 20 '19 at 01:17
  • Actually, your question is off-topic because it lacks a [mcve]. Just as an advise, also get a book about C++. Your use of `new` is very non-idiomatic and will sooner or later cause problems. Oh, and your pthread-version has a memory leak as well, AFAICT. – Ulrich Eckhardt Jul 21 '19 at 10:13
  • What’s the memory leak? The last approach with “new” was just a disperate attempt to find a solution – fedemengo Jul 21 '19 at 15:23
  • @UlrichEckhardt is correct about the memory leak in the pthread version. Since `pthread_exit()` is called to terminate the thread, the destructor for `dc` won't be called to free that object. You should instead simply `return` from the thread routine. – Michael Burr Jul 22 '19 at 04:54
  • @MichaelBurr oh I didn't know that, thanks! – fedemengo Jul 22 '19 at 05:13
  • I tested the exact same code that I was using on another machine and everything works just fine. Still a mystery why it fails on the other – fedemengo Jul 22 '19 at 04:03
  • Just out of curiosity, what compiler toolchain and target platform are you using? – Michael Burr Jul 22 '19 at 04:48
  • @MichaelBurr It fails on a AWS Ubuntu machine with `gcc version 7.4.0 (Ubuntu 7.4.0-1ubuntu1~18.04.1)` and it work on my laptop with Manjaro and `gcc version 9.1.0 (GCC) `. I guess it could be because of the older gcc!? – fedemengo Jul 22 '19 at 05:17
  • 1
    It is quite often that multithreaded program show an error on one machine but not on another. Sometimes all it takes is a tiny difference in the speed of execution and thread synchronize differently. – Piotr Siupa Jul 22 '19 at 07:51
  • @NO_NAME so do you think there is still a bug in the code since it fails on the other machine? – fedemengo Jul 22 '19 at 19:48
  • @fedemengo might also be a compiler bug, have you tried other(simpler) examples with `detach` on the aws machine? – Mike van Dyke Jul 23 '19 at 08:19
  • 1
    You say you get the same problem with @MikevanDyke's code - could you update your question to show the actual code you're running? Mike fixed a real bug, but if something else is wrong, it would be helpful to see exactly what you have. – Useless Jul 24 '19 at 13:08
  • @Useless I just update the question – fedemengo Jul 24 '19 at 18:35
  • Right, so do you have a version _without_ the error Mike fixed, but which _does_ still crash in `detach`? Your comment suggests you do, but the edit doesn't confirm it. If Mike's answer fixes your problem, then that's the answer. If it doesn't, you're still not showing us how to reproduce your issue. – Useless Jul 25 '19 at 08:34
  • The version Miked fixed is one the many version I tried to run, apparently the problem has something to do with the machine. I ran the code that I posted in the last edit and it gives me SEGFAULT on aws but works correctly on my laptop. I thought the SEGFAULT was due to thr captured objects going out of scope, that’s why I tried with variable allocated in the heap. But that wasn’t my first choice (regarding the implementation) and it turned out it didn’t fix the problem I was having. If I tried Mike fix on aws I still get segfault – fedemengo Jul 25 '19 at 14:57
  • 1
    How are you launching the processes in containers? if you're doing `docker run -d ubuntu:14.04 /my/process`, wouldn't it work to replace it with `docker run -d ubuntu:14.04 timeout 10s /my/process`? It even returns error code 124 to tell you that the process timed out. – root Jul 27 '19 at 18:29
  • @root I guess that would work, I just wanted to handle everything with the docker engine but that is definitely a valid alternative. I also tried with the docker container flags (`--stop-timeout 10`) but that never works – fedemengo Jul 27 '19 at 19:20
  • 1
    Depending on your design, you need to consider whether handling everything with the docker engine is worth adding a thread, ~20 lines of non-trivial c++ code in the solution below, plus whatever error handling you're missing from `stopContainer()` and `DockerClientpp::DockerClient` constructor. – root Jul 27 '19 at 19:27
  • Definitely, I decided to go with your suggestions in the end. Thanks! – fedemengo Jul 28 '19 at 06:30

2 Answers2

3

In the thread, you are accessing references to variables int *t and std::string *s which are local to the ContainerManager::execute() method. As soon as ContainerManager::execute() finishes, accesses to the two variables cause undefined behaviour and in your case the SEGFAULT. Instead pass the two pointers per value to the lamdba (and even better: don't use new at all):

void ContainerManager::execute() {
    int *t = new int;
    *t = timeOut;
    std::string *s = new std::string;
    *s = activeId;
    std::thread x([t, s]() { // <<--- Pass by value
        std::cout << "LOL" << std::endl;
        std::this_thread::sleep_for(std::chrono::seconds(*t));
        std::unique_ptr<DockerClientpp::DockerClient> _dc(new DockerClientpp::DockerClient());
        try {
            _dc->stopContainer(*s);
            std::cout << "[slave]forceStop(): Container will be force-stopped" << std::endl;
        } catch(std::exception &e) {
            // container has already been destroyed
            std::cout << "[slave]forceStop(): Error => " << e.what() << std::endl;
        }
    });
    std::cout << "Detaching" << std::endl;
    if(x.joinable()) {
        std::cout << ".. in a moment" << std::endl;                                                                             
        x.detach();
    }
}
Mike van Dyke
  • 2,724
  • 3
  • 16
  • 31
  • That’s true, I didn’t realize it. But that’s not the problem, I get segfault while I’m still in the method, right after calling “detach()”. I tried by capturing the pointers by value and the problems is the same – fedemengo Jul 21 '19 at 15:21
  • @fedemengo does it still happen, when you `join` instead of `detach`? Have you tried another compiler? – Mike van Dyke Jul 21 '19 at 20:39
  • Well your question still only seems to show the code that is objectively broken as per this answer. Perhaps you could edit in the fixed code that demonstrates the same issue? – Useless Jul 24 '19 at 08:53
  • @Useless Your comment is adressed to `fedemengo`, isn't it? – Mike van Dyke Jul 24 '19 at 10:57
  • 1
    Indeed it is. I'll comment on the question since @ doesn't work in comment threads ... – Useless Jul 24 '19 at 13:07
  • @MikevanDyke it only happens when detaching the thread, joining works just fine – fedemengo Jul 27 '19 at 19:01
2

The segfault suggests, to me, that the class is going out of scope, even though you expect it not to. Another possibility is that you're getting a race condition on the variables you are accessing.

Rather than capturing this in the lambda, try passing all variables by copy to the lambda. This will remove any race conditions having to do with scope, and solve any potential lifetime issues as the lambda will be completely decoupled from any other threads. Of course, this means no pointers or references to data elsewhere, make sure you are really doing a full copy of timeOut and activeId.

Alternatively, rather than detach, I would recommend storing the thread as a data member of the class. Then, join in the destructor. If the thread finishes earlier, the join will basically be a no-op. If the thread is not finished, that will prevent the resources the thread is using from going out of scope until the thread is finished. This would address variables going out of scope, but not any race conditions. Race conditions can be solved by using std::atomic or mutexes.

Since the second solution (using join, std::atomic, and/or mutexes) is more convoluted and requires checking lifetimes and race conditions, I would recommend the first solution (using a lambda that doesn't capture anything, with all arguments passed by copy) if possible.

Joel
  • 2,065
  • 2
  • 19
  • 30
  • I tried copy just the variables but the problem remained, the problem only occurs when I try to **detach** the thread. I'm gonna go with storing the thread and then joining it in the constructor, I really like this solution – fedemengo Jul 27 '19 at 19:00
  • If you have timing info, you could see if the issue really is the detach, or the thread. If you are sleeping for 10s, and everything is passed by copy, I expect you can see that the crash will come either right away (and the detach is causing the crash) or later (and the thread is causing a crash after sleeping). Another alternative is that you're not sleeping at all (t = 0) and the crash happens due to something in stopContainer (perhaps by violating its contract or bad library code). You could also remove all of the thread's code to make it simpler. Lots of options. – Joel Jul 27 '19 at 19:20
  • It comes right away, If you look at my first edit, not even the first line of the thread's code is being run. The moment I call detach I get segfault, and with gdb it seems the thread to be null – fedemengo Jul 27 '19 at 19:24
  • 1
    I'm just a little skeptical. Segfault's causes usually have to do with invalidly accessing memory. The detach itself has nothing to do with that. I'd be very reluctant to point the finger at detach. It's more likely a consequence of the detach. I'd try turning the thread into a no-op first, to see if that still segfaults, and add lines of code back. In your first example, you never showed the console output, so I can't tell that the detach itself is the issue. And print statements aren't conclusive either, if it's a race to either segfault or flush, it's possible the segfault is always first. – Joel Jul 27 '19 at 19:30
  • 1
    Also, I don't trust gdb if the code was compiled with any optimizations. Anything could be happening then. Typically, parallel code will behave differently with and without optimizations, because that changes timings, which changes sequences, which can change behavior. This is what makes debugging parallel code hard, for me- what appears to work in debug might fail in release. And then I can't trust the debugger in release, so I have to fall back to other arcane methods. – Joel Jul 27 '19 at 19:35
  • Also, nobody seems to have said this but it bears repeating. You can't just access memory across threads whenever you want. So once you start that thread, you either will need to protect memory accesses on variables somehow (atomic/mutex) or only let one thread access them. Remember, access means either read or write. Since your code doesn't appear to protect the memory, you're never going to use those data members again. In that case, moving or copying the memory to the thread makes more sense than capturing the memory in a lambda. – Joel Jul 27 '19 at 19:43
  • 1
    I tried capturing the single variables and now is working. I'm 100% sure I already tried it but apparently I was missing something. I guess the `detach` is not the problem – fedemengo Jul 28 '19 at 06:32