Parallelism: Subtly different floating point results?

Question

I'm trying to debug my parallelism library for the D programming language. A bug report was recently filed that indicates that the low-order bits of some floating point operations that are performed using tasks are non-deterministic across runs. (If you read the report, note that parallel reduce works under the hood by creating tasks in a deterministic way.)

This doesn't appear to be a rounding mode issue, because I tried setting the rounding mode manually. I'm also pretty sure this is not a concurrency bug. The library is well-tested (including passing a Jinx stress test), the issue is always confined to the low-order bits, and it happens even on single-core machines, where low-level memory model issues are less of a problem. What are some other reasons why floating point results might differ depending on what thread the operations are scheduled on?

Edit: I'm doing some printf debugging here and it seems like the results for the individual tasks are sometimes different across runs.

Edit # 2: The following code reproduces this issue in a much simpler way. It sums the terms of an array in the main thread, then starts a new thread to execute the exact same function. The problem is definitely not a bug in my library, because this code doesn't even use my library.

import std.algorithm, core.thread, std.stdio, core.stdc.fenv;

real sumRange(const(real)[] range) {
    writeln("Rounding mode:  ", fegetround);  // 0 from both threads.
    return reduce!"a + b"(range);
}

void main() {
    immutable n = 1_000_000;
    immutable delta = 1.0 / n;

    auto terms = new real[1_000_000];
    foreach(i, ref term; terms) {
        immutable x = ( i - 0.5 ) * delta;
        term = delta / ( 1.0 + x * x ) * 1;
    }

    immutable res1 = sumRange(terms);
    writefln("%.19f", res1);

    real res2;
    auto t = new Thread( { res2 = sumRange(terms); } );
    t.start();
    t.join();
    writefln("%.19f", res2);
}

Output:

Rounding mode: 0

0.7853986633972191094

Rounding mode: 0

0.7853986633972437348

Another Edit

Here's the output when I print in hex instead:

Rounding mode: 0

0x1.921fc60b39f1331cp-1

Rounding mode: 0

0x1.921fc60b39ff1p-1

Also, this only seems to happen on Windows. When I run this code on a Linux VM, I get the same answer for both threads.

ANSWER: It turns out that the root cause is that floating point state is initialized differently on the main thread than on other threads on Windows in D. See the bug report I just filed.

I'm sure you googled, but this link describes a few reasons why different CPU architectures could yield different floating point results (even though they both implement IEEE 754) http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html — Matt Bridges, Apr 16 '11 at 00:20
@Matt: Sorry for the misunderstanding. I'm talking about different threads **on the same hardware** and even the same physical CPU. I've just verified this happens on single-core machines if you still use multiple threads. — dsimcha, Apr 16 '11 at 00:30
The bug report you linked to mentions associativity -- might that be the cause of your problem? (i.e. you're not adding the same numbers in the same order each time.) — user541686, Apr 16 '11 at 00:36
@Mehrdad: Unless there's some detail of my code that's escaped my memory the summation order stuff is deterministic. One would expect slightly different results if the thread count changed or something, but not for identical thread counts/work unit sizes/etc. — dsimcha, Apr 16 '11 at 00:39
@Mehrdad: It requires that my module be installed, etc. I'm not sure that's the greatest idea. If you mean the code to the relevant parts of my module, it would be way too long. — dsimcha, Apr 16 '11 at 00:45
@dsimcha: Well we need something to be able to reproduce your bug using some code... you can always paste at [PasteBin](http://pastebin.com) if the code is too long. — user541686, Apr 16 '11 at 00:48
Are you sure the rounding mode is what you think it is? Have you tried logging the rounding mode at various points to make absolutely sure something else isn't changing it without you knowing? — DK., Apr 16 '11 at 03:36
Have you tried printing as Hex? Some kind of formatting bug would explain the symptoms. — BCS, Apr 16 '11 at 15:56
I get the same results (0.7853986633972191094) running your example code. (iC2D, osx) Hm, could it have something to do with the FPU state not being switched between threads or something? (I've no idea.) — 0scar, Apr 16 '11 at 16:14

score 2 · Answer 1 · answered Apr 16 '11 at 15:14

2

Here's a paper that explains the many reasons the same C code can lead to slightly different results. In your case, the most likely reason is CPU-internal instruction reordering.

It's simply wrong to expect floating-point calculations to be deterministic down to the low-order bits. That's not what floating-point numbers were designed to fulfill.

answered Apr 16 '11 at 15:14

Michael Borgwardt

342,105
78
482
720

1

I've browsed through the paper and I understand the basics of what it says. However, in this case we're talking about the same binary code (not just the same source code) running on the same hardware. The only difference is what thread it's running on. – dsimcha Apr 16 '11 at 16:03
3

After a quick scan of that, I didn't see where it talked about CPU-level instruction reordering. Also, as I understand it, if any instruction reordering a CPU did resulted in even a single bit difference, then it should be considered a CPU bug. -- OTOH that same kind of thing at the compiler level (minor variations in code resulting in different rounding locations, etc.) was mentioned and is a real issue. (BTW, in D/x86 `real` is an 80-bit float, so that math should never truncate.) – BCS Apr 16 '11 at 16:12
1

@dsimcha, @BCS: you're right, I misremembered that (it only mentions instruction reordering done by the compiler). But since multithreading is involved, could the difference be due to 80 bit register values being truncated to 64bits when a thread moves between cores? – Michael Borgwardt Apr 16 '11 at 16:39
1

Any OS that trims FP register during a context switch would have long ago been reported as broken. – BCS Apr 17 '11 at 01:28

Parallelism: Subtly different floating point results?

1 Answers1