Why does my program get stuck under Helgrind but not when run normally?

Helgrind is a happens-before based data race detection tool. It uses the valgrind instrumentation engine to add monitoring code to the program code as it executes.

However, the valgrind engine is not implemented to support multiple threads running concurrently --- it's not thread safe. Even though the user program may run multiple threads (otherwise, Helgrind would be useless), valgrind is only able to run one thread at a time because its internal data structures (the instrumentation engine, and the instrumented code itself) are not designed for multiple threads to run concurrently.

Valgrind ensures that only one thread is running at a time using a lock, more specifically, a Linux pthread_mutex_t regular mutex. That is, each thread must hold this single lock while executing code, and periodically the thread will drop this lock to give other threads a chance to execute for a bit. The thread then reacquires the lock to continue.

This is where the problem arises: dropping a lock, then reacquiring it, does not guarantee that in fact other threads waiting on that lock get it before the thread reacquiring it, due to the "unfair" nature of Linux's mutexes.
In fact, this is what tends to happen, typically in scenarios where the thread never moves into the BLOCKED state (that is, it's busy waiting on something, or for other reasons running in an infinite loop). In that scenario, valgrind's default mechanism fails: the other threads will never get the lock to run for a bit.

Fortunately, valgrind added a facility that overcomes the unfairness of Linux's mutexes and forces the threads to actually take turns. It is invoked by adding

valgrind --tool=helgrind --fair-sched=yes ....

You will need to use this flag to debug code where you have threads that do not give up the CPU, such as in one stage of the current version of ex3. Conversely, if you find yourself needing this flag in p2, you know you are doing something wrong since the worker threads should not normally be busy waiting.

Side Note: one of the amazing properties of happens-before based data race detection algorithms is that they can detect data races even under an execution regime in which the data race will not actually manifest (due to the serialized nature of execution inside valgrind).