While I can't help (yet!) on most of your issues, I think our C++ Test Coverage tool could provide you with multithreaded test coverage data pretty easily.
This tool instruments your source code; you compile and run that. You end up with (cheap)
instrumentation probes in your code representing various blocks. The instrumentation
records which parts of your program execute, nominally as a bit vector with one
bit per instrumentation probe. At the end of execution (or whenever you like), this
bit vector is dumped out and a viewer will show it to you superimposed on the code.
The trick to getting multihread test coverage is to know that we provide you complete
control over defining how the instrument probes work; they are macros. So rather than
using the default macro of essentially
probe[n]=true;
on a boolean array, you can instead implement
probe[n]|=1<<threadid;
on an int array (or something cleverly cheaper by precomputing this value).
This likely takes only a few lines of code to implement.
Folks might note this technically has synchronization troubles.
That's true, but at most it loses a bit
of coverage data, and the odds against it are pretty high. Most people
are happy with "pretty good" data rather than perfect. If you insist
on perfection, you'll pay a high synchonization price using some
atomic update instruction.
We also provide you control over the probe dumping logic; you can revise it to write out
thread-specific coverage data (in the tens of lines of custom code range).
The test coverage data viewer will then let you see thread-specific coverage
(just choose the right coverage vector); it also has built-in facility for
easily computing/displaying intersection/union/diff
on coverage vectors, which gives you exactly your relation of coverage-per-thread.