How to debug parallelized stochastic software?

Question

I am looking for concrete advice for dealing with a rather high-level problem: how to debug software (a Genetic Algorithm in case you are interested) that:

Runs tasks across multiple threads (I don't control which thread runs which task)
Each task's execution depends upon random values (I don't control the randomization seed)
A task's state is a complex graph of objects which cannot be easily serialized to a flat human-readable format

So far, I've tried the following:

Examining individual threads in a debugger: This is problematic because most tasks complete successfully (setting breakpoints in advance of a problem leads to many false positives). On the flip side, if I set a breakpoint that stops once a task is in a bad state, I cannot step back in time to figure out how I ended up there.
Dumping trace logs is great in theory (I can step back in time once I spot a bad state) but I haven't figured out yet how to serialize a task's state to a flat human-readable format.

In an ideal world, I would like to be able to set a breakpoint for a bad state then step back in time using a debugger to examine how I got to this point.

Have you run into this kind of problem before? How did you debug it?

score 0 · Answer 1 · answered Aug 03 '15 at 20:50

I have done the following:

Assign each task an immutable TaskID that gets included in the data structure that represents the task.
At the beginning of a task, log the TaskID and the complete state of the task's data structure.
At each significant execution step, write a single line log entry that includes the TaskID and any significant changes to the task's data structure.

Then, when you are diagnosing a problem, use a tool to filter the log to only the task you are interested in. The "poor man's" way to do this is to use grep. The "rich man's" way is to use something like Splunk.

Strategies using the debugger won't get you far enough, because (in terms of data for your diagnostics) you'll be limited to what you can "afford" to keep on the heap.

How to debug parallelized stochastic software?

1 Answers1