Is a "race condition" the only reason a parallelized code would give a different output than sequential?

Question

I have a function that, when I run it in parallel w/ diff inputs, gives a different output than when I run it sequentially with those same inputs.

Is a race condition the only reason this would happen?

EDIT: I have tried the following - I run 3 versions of the function in parallel, all with the same input. The 3 outputs are exactly the same, but they are still different than when I run the code without parallelization. So.. that means a "race condition" is not the issue, right? Otherwise the 3 results would be different?

what does the function do? in python I would say yes, race condition is the only cause. — Tayeb HAMDAOUI, Jun 08 '22 at 20:57
It could be some other bug, but race conditions are the most common type of bug specific to parallel code. — Barmar, Jun 08 '22 at 20:57
"Race condition" mostly means the symptom, not the cause. It's what you see (different results). Rather than what you did (e.g. missing lock) — PMF, Jun 08 '22 at 20:58
@TayebHAMDAOUI The function runs basically multiple nested for-loops to test out different hyperparameters for a reinforcement learning model and gets the results (selecting best parameters). I have the details and code in this post: https://stackoverflow.com/questions/72539251/joblib-package-why-is-parallel-giving-incorrect-output-changing-n-jobs-1?noredirect=1#comment128140669_72539251 — Vladimir Belik, Jun 08 '22 at 20:59
@Barmar What else could it be? I'd greatly appreciate it if you took a look at my edit. — Vladimir Belik, Jun 08 '22 at 21:00
@TayebHAMDAOUI I have made a potentially important edit - does that change your assessment of a race condition being the cause? — Vladimir Belik, Jun 08 '22 at 21:02
@PMF Would you mind taking a look at my edit? That result tells me that maybe a lock isn't what I'm missing. — Vladimir Belik, Jun 08 '22 at 22:01
I can see that you are running a function with joblib, if you are training the module, sometimes ML packages read from files (in my experience with opencv), I don't think file descriptors are seperate for threads. did you try with another python library like psutil? — Tayeb HAMDAOUI, Jun 08 '22 at 22:42
also see this, joblib uses Loky (I don't know what that is) as default not threads. so try this: `with parallel_backend('threading', n_jobs=3): Parallel()(delayed(function_name)(i) for i in input)` see the documentation of joblib: https://joblib.readthedocs.io/en/latest/parallel.html — Tayeb HAMDAOUI, Jun 08 '22 at 22:45
@TayebHAMDAOUI To be honest, I don't really know what you mean by the first comment there about file descriptors, but no, I haven't tried multiprocessing with psutil. Do you recommend it? Also, I couldn't get that exact line of code to work, but instead I did "Parallel(backend="threading", n_jobs=3)" and... I can't believe it. The result was a DIFFERENT one from the backend="loky" AND from the sequential one! That's three different results, sequential vs. parallel "loky" vs. parallel "threading". What in the world is going on here? — Vladimir Belik, Jun 09 '22 at 04:33
@PMF It think it's a bit complex, but if you look above at the link I gave to Tayeb, I have a simplified and complicated version of my code in that post. If you'd like to take a look, I'd greatly appreciate it! The full code is at the bottom of that post. — Vladimir Belik, Jun 09 '22 at 05:00

score 2 · Accepted Answer · answered Jun 08 '22 at 22:23

2

No, race conditions are not the only possible reason causing different results compared to the sequential one. There are plenty of possible reasons. One frequent issue with parallelism is that the order of the operations is modified. The thing is not all operations are associative or commutative. For example, floating-point operations are not associative (eg. (1 + 1e-16) - 1e-16 != 1 + (1e-16 - 1e-16) for IEEE-754 64-bit floating-point numbers). This means, a parallel sum will generally not gives the same results than a sequential one (funny point: the parallel one will often be slightly more accurate, so different does not mean wrong). Matrix multiplications are theoretically associative but not commutative (due to FP non-associativity, they are in fact neither associative nor commutative). Random-based algorithms (eg Monte-Carlo) can use a different seed in each thread generally resulting to the same statistical behaviour but different results (due to a different random sequence). Not to mentions that other bugs can appear in a parallel implementation (typically undefined behaviours).

answered Jun 08 '22 at 22:23

Jérôme Richard

41,678
6
29
59

How can I verify that it is the order of operations that is the cause of the differences? The difference in final result is not massive, but certainly more than a rounding error. How could I test to ensure that the reason is what you mentioned? – Vladimir Belik Jun 08 '22 at 23:36
Generally you can try to change the scheduling of the parallel operations. This is very dependent of the parallel runtime/method actually used. Note that the change is often not just due to the rounding of few operation. The ordering issue is often combined with a numerical instability causing an accumulation of problematic rounding. While a naive sum loop is very frequently used but is is not numerically stable. In fact, the accumulator start to be big compared to accumulated numbers. The standard deviation of the numbers has a huge impact on the result in this case. – Jérôme Richard Jun 09 '22 at 11:01
In such a case, the right solution is to use a numerically stable solution so that threads does not impact the results. For the naive sum, one solution is to implement a Kahan summation (or even better algorithms if the values are quite extreme). This is dependent of the actual algorithm. For example, the naive 1 pass standard deviation algorithm is known to be numerically unstable due to a catastrophic cancellation issue. Using a parallel code results in different results. Using a Kahan summation does not solve this problem. The 2 pass algorithm does since it is much more stable. – Jérôme Richard Jun 09 '22 at 11:10
As for undefined behaviours, you can track some of them using Clang sanitizers. Race conditions can also be tracked with Clang sanitizer although I am not sure all of them can be found (nor it can cause false positives). – Jérôme Richard Jun 09 '22 at 11:12
I'm gonna be honest, a lot of this is over my head. Nevertheless, I greatly appreciate your answer and your time. I am implementing a reinforcement learning algorithm (in parallel), so there are lots of potential complexities. I think I'm going to try implementing code that does everything exactly the same but without the actual reinforcement learning model. If there are still differences in the results, I will assume it's the floating point number issues, because only regular mathematical operations will be used that I can easily track (nothing fancy like RL). – Vladimir Belik Jun 09 '22 at 15:11
If you don't mind, could you confirm the logic I expressed in my edit? When I run the parallelized version with the same inputs (so 3 parallel instances, all with the exact same input), the outputs are all exactly equal to one another (even though they're different than the sequential version). Can I conclude with strong certainty from that information that the cause of the issue is *not* a race condition? – Vladimir Belik Jun 09 '22 at 15:16
*Absence of Evidence Is Not Evidence of Absence*. So, I cannot strictly confirm that there is not race condition (non deterministic race condition are quite frequent in fact). But the results indicates this race condition are unlikely to be the issue and I guess it is indeed due to floating point (FP) numbers. To check this, it might be a good idea to compute the result with a much better precision like >=256-bit FP numbers and see if it is not to far from the current one (see the decimal package). – Jérôme Richard Jun 09 '22 at 17:50
1

Thank you for your insight. I would like to use a more precise FP, but unfortunately, it seems that the reinforcement learning package I'm using by default casts itself to float32 precision, and there are so many parts that I'm not even sure how/which ones to force into higher precision. I think I'm going to try replacing the RL model with a simple linear regression model to see whether the difference in results persists. – Vladimir Belik Jun 09 '22 at 17:56
I also noticed I'm using np.array.astype('float') a few time throughout my code. Do you think this might be a source of imprecision? – Vladimir Belik Jun 09 '22 at 18:10
1

Not much. `float` means `float64` internally in Numpy (so 64-bit float which are already quite precise for most operations including the one of precise physics simulations). – Jérôme Richard Jun 09 '22 at 18:23

Is a "race condition" the only reason a parallelized code would give a different output than sequential?

1 Answers1