Why does my program get stuck under Helgrind but not when run normally?
Helgrind is a happens-before based data race detection tool. It uses the valgrind instrumentation engine to add monitoring code to the program code as it executes.
However, the valgrind engine is not implemented to support multiple threads running concurrently --- it's not thread safe. Even though the user program may run multiple threads (otherwise, Helgrind would be useless), valgrind is only able to run one thread at a time because its internal data structures (the instrumentation engine, and the instrumented code itself) are not designed for multiple threads to run concurrently.
Valgrind ensures that only one thread is running at a time using a lock, more
specifically, a Linux pthread_mutex_t
regular mutex. That is, each thread
must hold this single lock while executing code, and periodically the thread will
drop this lock to give other threads a chance to execute for a bit. The thread
then reacquires the lock to continue.
This is where the problem arises: dropping a lock, then reacquiring it, does
not guarantee that in fact other threads waiting on that lock get it before
the thread reacquiring it, due to the "unfair" nature of Linux's mutexes.
In fact, this is what tends to happen, typically in scenarios where the
thread never moves into the BLOCKED
state (that is, it's busy waiting
on something, or for other reasons running in an infinite loop). In that
scenario, valgrind's default mechanism fails: the other threads will never
get the lock to run for a bit.
Fortunately, valgrind added a facility that overcomes the unfairness of Linux's mutexes and forces the threads to actually take turns. It is invoked by adding
valgrind --tool=helgrind --fair-sched=yes ....
You will need to use this flag to debug code where you have threads that do not
give up the CPU, such as in one stage of the current version of ex3
.
Conversely, if you find yourself needing this flag in p2
, you know you are
doing something wrong since the worker threads should not normally be busy waiting.
Side Note: one of the amazing properties of happens-before based data race detection algorithms is that they can detect data races even under an execution regime in which the data race will not actually manifest (due to the serialized nature of execution inside valgrind).