4

I have programmed a boost::thread application, where I might have some race conditions based on valgrind/helgrind report. I want to identify the reason of these races.

The program is:

#include <boost/thread.hpp>

boost::mutex myMutex;
boost::condition_variable myConditionalVariable;
bool functionWasRun = false;

void function()
{
    {
        boost::lock_guard<boost::mutex> lock(myMutex);
        functionWasRun = true;
    }
    myConditionalVariable.notify_one();

    //doSomething1();
}

int main()
{
    boost::thread pThread(function);

    //Wait for the thread to start
    boost::unique_lock<boost::mutex> lock(myMutex);
    while (!functionWasRun)
        myConditionalVariable.wait(lock);

    //doSomething2();

    pThread.join();
}

For this simple program valgrind/helgrind is reporting the following errors:

==10840== Helgrind, a thread error detector
==10840== Copyright (C) 2007-2013, and GNU GPL'd, by OpenWorks LLP et al.
==10840== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==10840== Command: ./boost_network_test
==10840== 
==10840== ---Thread-Announcement------------------------------------------
==10840== 
==10840== Thread #1 is the program's root thread
==10840== 
==10840== ---Thread-Announcement------------------------------------------
==10840== 
==10840== Thread #2 was created
==10840==    at 0x6570EBE: clone (clone.S:74)
==10840==    by 0x4E44199: do_clone.constprop.3 (createthread.c:75)
==10840==    by 0x4E458BA: pthread_create@@GLIBC_2.2.5 (createthread.c:245)
==10840==    by 0x4C30C90: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==10840==    by 0x547B3B8: boost::thread::start_thread_noexcept() (in /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0)
==10840==    by 0x53CFCC: boost::thread::start_thread() (thread.hpp:180)
==10840==    by 0x53D31B: boost::thread::thread<void (&)()>(void (&)()) (thread.hpp:267)
==10840==    by 0x53CA7B: main (main_test.cpp:20)
==10840== 
==10840== ----------------------------------------------------------------
==10840== 
==10840== Possible data race during read of size 8 at 0x8A21E0 by thread #1
==10840== Locks held: none
==10840==    at 0x432CEB: boost::mutex::lock() (mutex.hpp:113)
==10840==    by 0x43D197: boost::unique_lock<boost::mutex>::lock() (lock_types.hpp:346)
==10840==    by 0x43C1A0: boost::unique_lock<boost::mutex>::unique_lock(boost::mutex&) (lock_types.hpp:124)
==10840==    by 0x53CA9E: main (main_test.cpp:23)
==10840== 
==10840== This conflicts with a previous write of size 8 by thread #2
==10840== Locks held: none
==10840==    at 0x432CF6: boost::mutex::lock() (mutex.hpp:113)
==10840==    by 0x43BFE9: boost::lock_guard<boost::mutex>::lock_guard(boost::mutex&) (lock_guard.hpp:38)
==10840==    by 0x53C9DE: function() (main_test.cpp:10)
==10840==    by 0x53DAAA: boost::detail::thread_data<void (*)()>::run() (thread.hpp:117)
==10840==    by 0x547BA49: ??? (in /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0)
==10840==    by 0x4C30E26: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==10840==    by 0x4E45181: start_thread (pthread_create.c:312)
==10840==    by 0x6570EFC: clone (clone.S:111)
==10840== 
==10840== ----------------------------------------------------------------
==10840== 
==10840== Possible data race during write of size 8 at 0x8A21E0 by thread #1
==10840== Locks held: none
==10840==    at 0x432CF6: boost::mutex::lock() (mutex.hpp:113)
==10840==    by 0x43D197: boost::unique_lock<boost::mutex>::lock() (lock_types.hpp:346)
==10840==    by 0x43C1A0: boost::unique_lock<boost::mutex>::unique_lock(boost::mutex&) (lock_types.hpp:124)
==10840==    by 0x53CA9E: main (main_test.cpp:23)
==10840== 
==10840== This conflicts with a previous write of size 8 by thread #2
==10840== Locks held: none
==10840==    at 0x432CF6: boost::mutex::lock() (mutex.hpp:113)
==10840==    by 0x43BFE9: boost::lock_guard<boost::mutex>::lock_guard(boost::mutex&) (lock_guard.hpp:38)
==10840==    by 0x53C9DE: function() (main_test.cpp:10)
==10840==    by 0x53DAAA: boost::detail::thread_data<void (*)()>::run() (thread.hpp:117)
==10840==    by 0x547BA49: ??? (in /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.54.0)
==10840==    by 0x4C30E26: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==10840==    by 0x4E45181: start_thread (pthread_create.c:312)
==10840==    by 0x6570EFC: clone (clone.S:111)
==10840== 
==10840== 
==10840== For counts of detected and suppressed errors, rerun with: -v
==10840== Use --history-level=approx or =none to gain increased speed, at
==10840== the cost of reduced accuracy of conflicting-access information
==10840== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 15 from 15)

Please could you help me to identify how can I fix this program?

This program is run on Ubuntu 14.04LTS, x86_64, gcc ver. 4.8.2, libboost v.1.54, valgrind 3.10.0. My original program is a complex application using ActiveObject class which has been stripped down to a bare minimum to see the race condition pattern.

This thread may seem similar to Debug boost::thread application, high false positive rate but is involves a conditional variable/mutex.

Community
  • 1
  • 1
radekg1000
  • 71
  • 7

3 Answers3

3

The good news is that your program is correct. All the races being reported by valgrind are false positives (although do consider unlocking the mutex before you join the thread - it could save you a headache later).

I have OSX10.10 so had to build valgrind from latest trunk in order to try this.

Here's what it gave me (truncated output below)

I rewrote the program with std::thread/std::mutex etc and had similar results.

truncated output:

==79537== Helgrind, a thread error detector
==79537== Copyright (C) 2007-2013, and GNU GPL'd, by OpenWorks LLP et al.
==79537== Using Valgrind-3.11.0.SVN and LibVEX; rerun with -h for copyright info
==79537== Command: Build/Products/Debug/cpprules
==79537== 
--79537-- Build/Products/Debug/cpprules:
--79537-- dSYM directory is missing; consider using --dsymutil=yes
--79537-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option
--79537-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 2 times)
--79537-- UNKNOWN mach_msg unhandled MACH_SEND_TRAILER option (repeated 4 times)
==79537== ---Thread-Announcement------------------------------------------
==79537== 
==79537== Thread #2 was created
==79537==    at 0x10048D7EE: __bsdthread_create (in /usr/lib/system/libsystem_kernel.dylib)
==79537==    by 0x10003FC75: boost::thread::start_thread_noexcept() (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10000134E: main (in Build/Products/Debug/cpprules)
==79537== 
==79537== ---Thread-Announcement------------------------------------------
==79537== 
==79537== Thread #1 is the program's root thread
==79537== 
==79537== ----------------------------------------------------------------
==79537== 
==79537== Possible data race during read of size 4 at 0x1005AA2D8 by thread #2
==79537== Locks held: none
==79537==    at 0x10058FFCF: spin_lock (in /usr/lib/system/libsystem_platform.dylib)
==79537==    by 0x1005A3278: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A14B0: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537== 
==79537== This conflicts with a previous write of size 4 by thread #1
==79537== Locks held: none
==79537==    at 0x10058FFDF: spin_unlock (in /usr/lib/system/libsystem_platform.dylib)
==79537==    by 0x10003FC75: boost::thread::start_thread_noexcept() (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10000134E: main (in Build/Products/Debug/cpprules)
==79537==  Address 0x1005aa2d8 is in the Data segment of /usr/lib/system/libsystem_pthread.dylib
==79537== 
==79537== ----------------------------------------------------------------
==79537== 
==79537== Possible data race during read of size 4 at 0x103DEB010 by thread #2
==79537== Locks held: none
==79537==    at 0x1005A3293: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A3278: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A14B0: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537== 
==79537== This conflicts with a previous write of size 4 by thread #1
==79537== Locks held: none
==79537==    at 0x1005A30D8: pthread_create (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x10003FC75: boost::thread::start_thread_noexcept() (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10000134E: main (in Build/Products/Debug/cpprules)
==79537==  Address 0x103deb010 is in a rw- mapped file segment
==79537== 
==79537== ----------------------------------------------------------------
==79537== 
==79537== Possible data race during write of size 4 at 0x103DEB010 by thread #2
==79537== Locks held: none
==79537==    at 0x1005A329B: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A3278: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A14B0: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537== 
==79537== This conflicts with a previous write of size 4 by thread #1
==79537== Locks held: none
==79537==    at 0x1005A30D8: pthread_create (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x10003FC75: boost::thread::start_thread_noexcept() (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10000134E: main (in Build/Products/Debug/cpprules)
==79537==  Address 0x103deb010 is in a rw- mapped file segment
==79537== 
==79537== ----------------------------------------------------------------
==79537== 
==79537== Possible data race during write of size 4 at 0x1005AA2D8 by thread #2
==79537== Locks held: none
==79537==    at 0x10058FFDF: spin_unlock (in /usr/lib/system/libsystem_platform.dylib)
==79537==    by 0x1005A3278: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A14B0: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537== 
==79537== This conflicts with a previous write of size 4 by thread #1
==79537== Locks held: none
==79537==    at 0x10058FFDF: spin_unlock (in /usr/lib/system/libsystem_platform.dylib)
==79537==    by 0x10003FC75: boost::thread::start_thread_noexcept() (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10000134E: main (in Build/Products/Debug/cpprules)
==79537==  Address 0x1005aa2d8 is in the Data segment of /usr/lib/system/libsystem_pthread.dylib
==79537== 
==79537== ----------------------------------------------------------------
==79537== 
==79537== Possible data race during read of size 4 at 0x10004D118 by thread #2
==79537== Locks held: none
==79537==    at 0x100048451: boost::thread_detail::enter_once_region(boost::once_flag&) (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10003FA54: boost::detail::set_current_thread_data(boost::detail::thread_data_base*) (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10003FD6C: boost::(anonymous namespace)::thread_proxy(void*) (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x1005A32FB: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A3278: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A14B0: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537== 
==79537== This conflicts with a previous write of size 4 by thread #1
==79537== Locks held: none
==79537==    at 0x10004854A: boost::thread_detail::commit_once_region(boost::once_flag&) (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10003F9F4: boost::detail::get_current_thread_data() (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10000139F: main (in Build/Products/Debug/cpprules)
==79537==  Address 0x10004d118 is in the Data segment of /usr/local/lib/libboost_thread.dylib
==79537== 
==79537== ----------------------------------------------------------------
==79537== 
==79537== Possible data race during read of size 8 at 0x10004D120 by thread #2
==79537== Locks held: none
==79537==    at 0x10003FA78: boost::detail::set_current_thread_data(boost::detail::thread_data_base*) (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10003FD6C: boost::(anonymous namespace)::thread_proxy(void*) (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x1005A32FB: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A3278: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A14B0: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537== 
==79537== This conflicts with a previous write of size 8 by thread #1
==79537== Locks held: none
==79537==    at 0x1005A1618: pthread_key_create (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x10003F9E8: boost::detail::get_current_thread_data() (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10000139F: main (in Build/Products/Debug/cpprules)
==79537==  Address 0x10004d120 is in the Data segment of /usr/local/lib/libboost_thread.dylib
==79537== 
==79537== ----------------------------------------------------------------
==79537== 
==79537== Possible data race during read of size 8 at 0x1005AC448 by thread #2
==79537== Locks held: none
==79537==    at 0x1005A269E: pthread_setspecific (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A32FB: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A3278: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A14B0: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537== 
==79537== This conflicts with a previous write of size 8 by thread #1
==79537== Locks held: none
==79537==    at 0x1005A1614: pthread_key_create (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x10003F9E8: boost::detail::get_current_thread_data() (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10000139F: main (in Build/Products/Debug/cpprules)
==79537==  Address 0x1005ac448 is in the Data segment of /usr/lib/system/libsystem_pthread.dylib
==79537== 
==79537== ----------------------------------------------------------------
==79537== 
==79537== Possible data race during write of size 1 at 0x100009D48 by thread #2
==79537== Locks held: none
==79537==    at 0x100000F49: function() (in Build/Products/Debug/cpprules)
==79537==    by 0x10003FD79: boost::(anonymous namespace)::thread_proxy(void*) (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x1005A32FB: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A3278: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A14B0: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537== 
==79537== This conflicts with a previous read of size 1 by thread #1
==79537== Locks held: none
==79537==    at 0x10000138D: main (in Build/Products/Debug/cpprules)
==79537==  Address 0x100009d48 is in the Data segment of Build/Products/Debug/cpprules
==79537== 
==79537== ----------------------------------------------------------------
==79537== 
==79537== Possible data race during write of size 8 at 0x100009D28 by thread #2
==79537== Locks held: none
==79537==    at 0x1005A4099: _pthread_cond_updateval (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A294D: _pthread_cond_signal (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x100000F77: function() (in Build/Products/Debug/cpprules)
==79537==    by 0x10003FD79: boost::(anonymous namespace)::thread_proxy(void*) (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x1005A32FB: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A3278: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A14B0: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537== 
==79537== This conflicts with a previous write of size 8 by thread #1
==79537== Locks held: none
==79537==    at 0x1005A3DE1: _pthread_cond_wait (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1000014AA: main (in Build/Products/Debug/cpprules)
==79537==  Address 0x100009d28 is in the Data segment of Build/Products/Debug/cpprules
==79537== 
==79537== ----------------------------------------------------------------
==79537== 
==79537== Possible data race during read of size 8 at 0x1005AC448 by thread #2
==79537== Locks held: none
==79537==    at 0x1005A4791: _pthread_tsd_cleanup (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A44E4: _pthread_exit (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A3306: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A3278: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A14B0: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537== 
==79537== This conflicts with a previous write of size 8 by thread #1
==79537== Locks held: none
==79537==    at 0x1005A1614: pthread_key_create (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x10003F9E8: boost::detail::get_current_thread_data() (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x10000139F: main (in Build/Products/Debug/cpprules)
==79537==  Address 0x1005ac448 is in the Data segment of /usr/lib/system/libsystem_pthread.dylib
==79537== 

...

==79537== ----------------------------------------------------------------
==79537== 
==79537== Possible data race during write of size 8 at 0x100009C98 by thread #1
==79537== Locks held: none
==79537==    at 0x1005A37DD: pthread_mutex_destroy (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x100005D50: boost::mutex::~mutex() (in Build/Products/Debug/cpprules)
==79537==    by 0x1003A38EC: __cxa_finalize_ranges (in /usr/lib/system/libsystem_c.dylib)
==79537==    by 0x1003A3BEF: exit (in /usr/lib/system/libsystem_c.dylib)
==79537==    by 0x1002F05CF: start (in /usr/lib/system/libdyld.dylib)
==79537== 
==79537== This conflicts with a previous read of size 8 by thread #2
==79537== Locks held: none
==79537==    at 0x1005A1655: _pthread_mutex_lock (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x100000EBB: function() (in Build/Products/Debug/cpprules)
==79537==    by 0x10003FD79: boost::(anonymous namespace)::thread_proxy(void*) (in /usr/local/lib/libboost_thread.dylib)
==79537==    by 0x1005A32FB: _pthread_body (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A3278: _pthread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==    by 0x1005A14B0: thread_start (in /usr/lib/system/libsystem_pthread.dylib)
==79537==  Address 0x100009c98 is in the Data segment of Build/Products/Debug/cpprules
==79537== 
==79537== 
==79537== For counts of detected and suppressed errors, rerun with: -v
==79537== Use --history-level=approx or =none to gain increased speed, at
==79537== the cost of reduced accuracy of conflicting-access information
==79537== ERROR SUMMARY: 34 errors from 33 contexts (suppressed: 0 from 0)
Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
  • Your output shows that the problem is not just bound to my PC and is reproducible. I have translated my code to POSIX Threads as in the post above. Please could you check if my translation compiles on Mac OS and does not produce errors either? – radekg1000 Dec 18 '14 at 12:53
  • will do when I get home – Richard Hodges Dec 18 '14 at 14:01
  • Richard, thank you for your help. I have put the final answer above (the fprofile-arcs thing). I wonder what where the flags that you used so that the problem was reported in your case? – radekg1000 Dec 18 '14 at 15:14
  • I'll need to check. happy to have been part of finding a solution. I've learned something today :) – Richard Hodges Dec 18 '14 at 16:09
2

As someone said there was something strange happening behind the scenes and now I found the reason.

There is nothing strange in the code in both the boost and the POSIX variant. The gist is that the code was compiled in the testing framework with -fprofile-arcs flag which I use for code coverage (lcov). This flag is instrumenting the program and changing its behaviour. In my case any object that appears in the program (e.g. boost::mutex) becomes instrumented and as such interferes with helgrind output. When the program was converted into POSIX Threads all objects disappeared - the interference too, so this would explain why the POSIX variant worked.

Please try to remember: HELGRIND = NO -fprofile-arcs FLAG

radekg1000
  • 71
  • 7
-2

Could it be the race condition on the myConditionalVariable?

If its the case, below will solve the problem.

void function()
{
    {
        boost::lock_guard<boost::mutex> lock(myMutex);
        functionWasRun = true;
        myConditionalVariable.notify_one();
    }

    //doSomething1();
}

int main()
{
    boost::thread pThread(function);

    //Wait for the thread to start
    while (!functionWasRun)
    {
        boost::unique_lock<boost::mutex> lock(myMutex);
        myConditionalVariable.wait(lock);
    }

    //doSomething2();

    pThread.join();
}
9dan
  • 4,222
  • 2
  • 29
  • 44
  • It's a logic error to notify the condition variable while the lock is held. – Richard Hodges Dec 18 '14 at 10:21
  • Sorry, it did not help. Now I get a different complaint from helgrind: Possible data race during read of size 1 at ... by thread #1 Locks held: none (at ...: main_test.cpp:23), This conflicts with a previous write of size 1 by thread #2 Locks held: 1, at address ... at ...: function() (main_test.cpp:11) – radekg1000 Dec 18 '14 at 10:55
  • My example follows the example from boost library documentation http://www.boost.org/doc/libs/1_54_0/doc/html/thread/synchronization.html#thread.synchronization.condvar_ref about how to use conditional variables (the same mutex, lock and condition variable types). – radekg1000 Dec 18 '14 at 10:58
  • 1
    It is potentially slower to notify a condition variable while holding the lock, but it is still valid to do so. The modified code above reads `functionWasRun` without the lock, which is a very real data race. – Jonathan Wakely Dec 18 '14 at 11:40
  • 1
    in `condition_variable::wait(lock, pred)` the predicate is run while the mutex is locked. The modified code was exactly equivalent to the OP's program. – Richard Hodges Dec 18 '14 at 12:13
  • @JonathanWakely we could, if there is only one thread notifying. I'd always lock first for the sake of code maintenance. – 9dan Dec 19 '14 at 02:02
  • I'd worry more about reading shared variables outside locks than about signalling condition variables outside locks. And the question isn't whether you lock first, as in almost all cases you have to lock anyway, it's whether you notify before or after _un_-locking. – Jonathan Wakely Dec 19 '14 at 10:01