15

In making a somewhat large refactoring change that did not modify any kind of arithmetic, I managed to somehow change the output of my program (an agent based simulation system). Various numbers in the output are now off by miniscule amounts. Examination shows that these numbers are off by 1 bit in their least significant bit.

For example, 24.198110084326416 would become 24.19811008432642. The floating point representation of each number is:

24.198110084326416 = 0 10000000011 1000001100101011011101010111101011010011000010010100
24.19811008432642  = 0 10000000011 1000001100101011011101010111101011010011000010010101

In which we notice that the least significant bit is different.

My question is how I could have introduced this change when I had not modified any type of arithmetic? The change involved simplifying an object by removing inheritance (its super class was bloated with methods that were not applicable to this class).

I note that the output (displaying the values of certain variables at each tick of the simulation) sometimes will be off, then for another tick, the numbers are as expected, only to be off again for the following tick (eg, on one agent, its values exhibit this problem on ticks 57 - 83, but are as expected for ticks 84 and 85, only to be off again for tick 86).

I'm aware that we shouldn't compare floating point numbers directly. These errors were noticed when an integration test that merely compared the output file to an expected output failed. I could (and perhaps should) fix the test to parse the files and compare the parsed doubles with some epsilon, but I'm still curious as to why this issue may have been introduced.

EDIT:

Minimal diff of change that introduced the problem:

diff --git a/src/main/java/modelClasses/GridSquare.java b/src/main/java/modelClasses/GridSquare.java
index 4c10760..80276bd 100644
--- a/src/main/java/modelClasses/GridSquare.java
+++ b/src/main/java/modelClasses/GridSquare.java
@@ -63,7 +63,7 @@ public class GridSquare extends VariableLevel
    public void addHousehold(Household hh)
    {
        assert household == null;
-       subAgents.add(hh);
+       neighborhood.getHouseholdList().add(hh);
        household = hh;
    }

@@ -73,7 +73,7 @@ public class GridSquare extends VariableLevel
    public void removeHousehold()
    {
        assert household != null;
-       subAgents.remove(household);
+       neighborhood.getHouseholdList().remove(household);
        household = null;
    }

diff --git a/src/main/java/modelClasses/Neighborhood.java b/src/main/java/modelClasses/Neighborhood.java
index 834a321..8470035 100644
--- a/src/main/java/modelClasses/Neighborhood.java
+++ b/src/main/java/modelClasses/Neighborhood.java
@@ -166,9 +166,14 @@ public class Neighborhood extends VariableLevel
    World world;

    /**
+    * List of all grid squares within the neighborhood.
+    */
+   ArrayList<VariableLevel> gridSquareList = new ArrayList<>();
+
+   /**
     * A list of empty grid squares within the neighborhood
     */
-   ArrayList<GridSquare> emptyGridSquareList;
+   ArrayList<GridSquare> emptyGridSquareList = new ArrayList<>();

    /**
     * The neighborhood's grid square bounds
@@ -836,7 +841,7 @@ public class Neighborhood extends VariableLevel
     */
    public GridSquare getGridSquare(int i)
    {
-       return (GridSquare) (subAgents.get(i));
+       return (GridSquare) gridSquareList.get(i);
    }

    /**
@@ -865,7 +870,7 @@ public class Neighborhood extends VariableLevel
    @Override
    public ArrayList<VariableLevel> getGridSquareList()
    {
-       return subAgents;
+       return gridSquareList;
    }

    /**
@@ -874,12 +879,7 @@ public class Neighborhood extends VariableLevel
    @Override
    public ArrayList<VariableLevel> getHouseholdList()
    {
-       ArrayList<VariableLevel> list = new ArrayList<VariableLevel>();
-       for (int i = 0; i < subAgents.size(); i++)
-       {
-           list.addAll(subAgents.get(i).getHouseholdList());
-       }
-       return list;
+       return subAgents;
    }

Unfortunately, I'm unable to create a small, compilable example, due to the fact that I am unable to replicate this behavior outside of the program nor cut this very large and entangled program down to size.

As for what kind of floating point operations are being done, there's nothing particularly exciting. A ton of addition, multiplication, natural logarithms, and powers (almost always with base e). The latter two are done with the standard library. Random numbers are used throughout the program, and are generated with Random class included with the framework being used (Repast).

Most numbers are in the range of 1e-3 to 1e5. There's almost no very large or very small numbers. Infinity and NaN is used in many places.

Being an agent based simulation system, many formulas are repetitively applied to simulate emergence. The order of evaluation is very important (as many variables depend on others being evaluated first -- eg, to calculate the BMI, we need the diet and cardio status to be calculated first). The previous values of variables is also very important in many calculations (so this issue could be introduced somewhere early in the program and be carried throughout the rest of it).

Kat
  • 4,645
  • 4
  • 29
  • 81
  • 7
    Are you using `strictfp`? – user253751 Jul 30 '14 at 21:31
  • 4
    Did you somehow change the order of your mathematical operations? – rgettman Jul 30 '14 at 21:31
  • 4
    There are a number of ways that could happen, but my money's on the optimization step of the compiler shuffled arithmetic operations around differently because it got a different AST as input, and that resulted in a change to output. Although... then again... any good optimizer first guarantees it doesn't change the semantic meaning of operations, so I'm not sure. Need to mull this over a bit more. – Parthian Shot Jul 30 '14 at 21:32
  • @ParthianShot - In theory the compiler should not be reordering FP ops in a way that could cause a difference. Java really hammered on this issue. – Hot Licks Jul 30 '14 at 21:42
  • 2
    Just as a sanity check... This is a consistent thing, right? Can't just be caused by a non-ECC memory? And your machine isn't bathed in gamma radiation? – Parthian Shot Jul 30 '14 at 21:44
  • And you're using the same compiler each time? – Parthian Shot Jul 30 '14 at 21:48
  • If you still have the old source, go back and try it again, and make sure there's no difference between compiler, compiler parms, and JVM between the two. – Hot Licks Jul 30 '14 at 21:56
  • 1
    @HotLicks that's *if* you're using `strictfp`. – user253751 Jul 30 '14 at 22:02
  • 1
    @immibis - I'm (more than) a little fuzzy on that. The permitted transformations were limited even in non-strict mode, but I don't recall the details. – Hot Licks Jul 30 '14 at 22:34
  • 1
    See [4.2.3 Floating-Point Types, Formats, and Values](http://docs.oracle.com/javase/specs/jls/se5.0/html/typesValues.html#9208) for the differences between strict and non-strict. – Patricia Shanahan Jul 30 '14 at 22:55
  • @HotLicks: http://docs.oracle.com/javase/specs/jls/se7/html/jls-15.html#jls-15.4 suggests that omitting `strictfp` only permits intermediate results to use a larger exponent range. This flies in the face of my personal experience with Hotspot, however. – tmyklebu Jul 30 '14 at 23:03
  • @immibis, no `strictfp`. Although the machine (both hardware and software) has not changed in the course of this refactoring, so I don't see why `strictfp`, as I understand it, would make a difference here. – Kat Jul 31 '14 at 14:24
  • @rgettman shouldn't be. There's a handful of places where *integers* may have different orders, but my changes don't even touch any doubles! – Kat Jul 31 '14 at 14:25
  • @ParthianShot, yes, it's consistent (I've switched between the feature and master branches to be sure). Exact same hardware and software (including compiler). – Kat Jul 31 '14 at 14:26
  • I'm going to try and re-implement this change in more gradual steps in hopes of determining exactly what changes may have caused this. – Kat Jul 31 '14 at 14:27
  • Complete shot in the dark, here, but does that number depend at all on durations within the program, or a PRNG that might be trying some simple entropy harvesting, or... anything other than direct inputs? Maybe stack size or something? – Parthian Shot Jul 31 '14 at 15:12
  • @ParthianShot Nothing is time dependent. There is a RNG being used, but it has a constant seed (for our test suite, anyway) and no calls to the RNG have been added or removed in this change (although that would probably cause much, much larger changes, anyway). – Kat Jul 31 '14 at 15:17
  • For those interested, I have isolated the change that causes this issue. It's even more bizarre than I expected. Here's the diff: http://pastebin.com/RgLVDxG5. Context: there's various "levels" of entities in this simulation: world, neighborhood, grid square, household, and agent. Each level stores a list of "subagents" (which is the lower level entities that are contained within that entity). Grid square stands out as being very useless, so is being removed from this list of entities. Thus, the neighborhood's subagents are households. Grid squares are not "levels" anymore. – Kat Jul 31 '14 at 15:30
  • So it really just changes a few lists. I don't see how it would impact floating point results, but somehow it does (in the manner described in the post). I've also confirmed that this tiny diff is the only thing affecting the results. As an aside, the diff isn't complete in the sense that changes are correct. The program can get grid squares, but can't add them. Its merely the earliest point at which this problem was reintroduced. – Kat Jul 31 '14 at 15:31
  • @Mike The paste has been removed... Also, congratulations on being the first person I've heard of to actually use pastebin for posting a code snippet. Generally when I heard the word "pastebin" the sentence also includes the phrase "leaked passwords". – Parthian Shot Jul 31 '14 at 15:33
  • @ParthianShot That's strange. I've rehosted the diff on a Gist: https://gist.github.com/anonymous/39a6389e10abb161257c – Kat Jul 31 '14 at 15:37
  • @Mike: I was able to access both diff links, and if that's the causing the switch. BTW, to keep it from being closed, perhaps edit and put the diff link into the question. Also, can you give an example of what the FP code looks like? – Menachem Jul 31 '14 at 21:35
  • @Menachem, I updated the post as you mentioned, including details about what kind of FP operations are being done. – Kat Jul 31 '14 at 22:37
  • 4
    My memory is returning slightly on `strictfp`. If you *did not* specify `strictfp` then the JVM is at (some) liberty to carry intermediate values as 80-bit rather than 64-bit quantities. (With `strictfp`, after every computation any intermediate values must be "narrowed" to 64 bits.) Whether or not a value is "narrowed" can make a 1-lsb difference in the result (probably about half the time). Likely when the OP removed the "useless" "grid squares" (or whatever they were) it removed some adds of 0.0 or some such and changed which values get coerced to 64 bits when. – Hot Licks Aug 01 '14 at 00:10
  • 1
    The diff should be posted *here.* Otherwise the question has no permanent value and is liable for deletion. – user207421 Aug 01 '14 at 01:04
  • 1
    @HotLicks, I just applied `strictfp` to all classes, enums, and interfaces of my program. The output did not change. So I don't think `strictfp` is the solution here. – Kat Aug 01 '14 at 17:29
  • Sounds good. Note that even without `strictfp` the output would only change if you recompiled the code, either with (seemingly innocuous) code changes or with a new version of *javac*. Or possibly with a new JDK version. – Hot Licks Aug 01 '14 at 17:34

3 Answers3

1

Here are a few ways in which the evaluation of a floating-point expression can differ:

(1) Floating point processors have a "current rounding mode", which could cause results to differ in the least significant bit. You can make a call which you can Get or Set the current value: round toward zero, toward -∞, or toward +∞.

(2) It sounds like the strictfp is related to the FLT_EVAL_METHOD in C which specifies the precision to be used in intermediate computations. Sometimes a new version of the compiler will use a different method than the old one (I was bitten by that one). {0,1,2} correspond to {single,double,extended} precision respectively unless overriden by higher precision operands.

(3) In the same way that a different compiler can have a different default float evaluation method, different machines can use a different float evaluation method.

(4) Single precision IEEE floating-point arithmetic is well-defined, repeatable, and machine-independent. So is double-precision. I have written (with great care) cross-platform floating-point tests which use an SHA-1 hash to check the computations for bit exactness! However, with FLT_EVAL_METHOD=2, extended precision is used for the intermediate computations, which is variously implemented using 64-bit, 80-bit or 128-bit floating point arithmetic, so it is difficult to get cross-platform and cross-compiler repeatability if extended precision is used in the intermediate computations.

(5) Floating point arithmetic is not associative, i.e.

(A + B) + C ≠ A + (B + C)

Compilers are not allowed to reorder computations of floating-point numbers because of this.

(6) Order of operations matter. An algorithm to compute the sum of a large set of numbers in the greatest possible precision, is to sum them in increasing order of magnitude. On the other hand, if two numbers differ enough in magnitude

B < (A * epsilon)

then summing them is a no-op:

A + B = A
Reality Pixels
  • 376
  • 2
  • 6
0

As strictfp has been eliminated, I'll offer an idea.

Some versions of Repast had / have bugs with certain random numbers being generated incorrectly*.

Even with the random seed set to the same value, as your ArrayList is created and used at a different point in your code, it is possible that you are acting on the agents in it in a different order. This is particularly true if you have any scheduled method with random priority. It is also the case if you use getAgentList() or similar to populate your subAgents list. In effect you can generate a random number(/order) that is outside the RNG for which you set the seed.

If there is a slight difference in order of execution, this could explain correspondence at one step only to see this small difference at other steps.

I have had this happen and had similar headaches to your report when debugging. Happy to go into more detail if you can provide them.


*It will help a lot to know which version you are using (I know I shouldn't ask for clarification in an answer, but haven't got the rep to comment). From the API you link, I think you are using the old Repast 3 - I use Simphony, but the answer may still apply.

J Richard Snape
  • 20,116
  • 5
  • 51
  • 79
  • Using Repast 3.1. I'd like to be using Repast S, but there's no clear upgrade path, so we haven't bothered upgrading, yet. – Kat Aug 05 '14 at 18:51
  • Agents don't seem to be acted on in a different order (indices are used heavily to refer to particular agents, variables, etc, so their order can't trivially change). Regarding scheduling, the scheduling is very simple: the step method scheduled at the beginning, an experimental stage scheduled at a certain tick (currently disabled), and a recorder scheduled at the end. – Kat Aug 05 '14 at 18:57
  • Regarding execution order, the order is definitely different in some places, but no `Random` calls are in the changed methods. I'd also expect much larger changes if the `Random` calls were switched around. – Kat Aug 05 '14 at 19:00
  • re upgrading 3.1 to S - I believe it's hard and requires refactoring. I've found even between Sv1.2 and Sv2.1 a "challenge". The fact you're using 3.1 means my detailed experience probably won't apply. – J Richard Snape Aug 06 '14 at 10:05
  • Without knowing your code, I too would generally expect reordered calls to Random to make more difference. If you want to chase this more - happy to discuss. When you said it was consistent, is always same bit shift (i.e. LSB 0 -> 1)? – J Richard Snape Aug 06 '14 at 10:24
  • Actually, further inspection shows that numbers are not always off by a bit in the LSB. Some numbers are one bit off in the 5th LSB. Every number that I checked was off by 1 bit (although that may be chance -- I wasn't able to check every single number). Exactly which bit is different varies, but is one of the last several. – Kat Aug 06 '14 at 16:02
  • That makes me think even more that we're looking at a change in order of arithmetic operations. As people have said, the compiler is supposed not to do that. So, I suspect a) the variable is calculated based on a number from another agent and the order of evaluation has changed so, say, a divisor or multiplier has changed slightly or b) your refactor has changed the flow through arithmetic calculations subtly. I don't think I can go further without more context (code). I guess it's not on github or anything? – J Richard Snape Aug 07 '14 at 09:18
  • Very possible. Unfortunately, no, the code is not available online and the licensing status is rather dubious, so I can't post it. I haven't been able to find any change to arithmetic, but eventually gave up and created a parser for the output that compares FP with an epsilon. – Kat Aug 07 '14 at 15:16
-2

Without exact source code to reproduce the problem it is obviously impossible to pinpoint the problem. But your diff shows, that you changed the way lists get processed. You also mention that a lot of simple math, like addition happens inside your application. Therefor my guess is that by the changes to the list you change the order in which stuff gets processed, which may be enough to change rounding errors.

And yes, nothing should ever rely on the least significant bits of floating point variables, so the tests should need epsilons.

Jens Schauder
  • 77,657
  • 34
  • 181
  • 348
  • 1
    -1 for generally posting a comment as an answer, and in particular for the sweeping “nothing should ever rely on the least significant bits of floating point variables” statement. For all we know, the OP has found a compiler bug. What you are saying is “the test that reveals that your compiler miscompiled or miscompiles your code is a bad test. You should never write the kind of test that reveals that the compiler has bugs on your code”. Duh. – Pascal Cuoq Aug 01 '14 at 06:16