comparing two sums of floating point values in C or C++

Question

Assume You're given two sets of floating point variables implemented according to IEEE754, meant to be treated as exact values calculated according to formulae present in standard. All legal values are possible. The amount of variables in set may be any natural number.

What would be a good way to compare exact, in mathematical sense, sums of values represented by said variables. Due to domain's nature, the problem can easily be represented as comparing a single sum to zero. You can disregard the possibility of presence of NaNs or Infinities, as it is irrelevant to core problem. (Those values can be checked for easily and independently, and acted upon in a manner suiting particular application of this problem.)

A naive approach would be to simply sum and compare, or sum values of one set and subtract values of another.

    bool compare(const std::vector<float>& lhs, const std::vector<float>& rhs)
    {
        float lSum = 0.0f;
        for (auto value : lhs)
        {
            lSum += value;
        }
        float rSum = 0.0f;
        for (auto value : rhs)
        {
            rSum += value;
        }

        return lSum < rSum;
    }

Quite obviously there are problems with naive approach, as mentioned in various other questions regarding floating point arithmetic. Most of the problems are related to two difficulties:

result of addition of floating point values differs depending on order

certain orders of addition of certain sets of values may result in intermediate overflow (intermediate result of calculations goes beyond range supported by available data type)

float small = strtof("0x1.0p-126", NULL);
float big = strtof("0x1.8p126", NULL);

std::cout << std::hexfloat << small + big - big << std::endl;
std::cout << std::hexfloat << (big-2*small) + (big-small) + big - (big+small) - (big+2*small) << std::endl;

This code will result in 0 and inf; this illustrates how ordering affects the result. Hopefully, also that the problem of ordering is non-trivial.

float prev;
float curr = 0.0f;

do
{
    prev = curr;
    curr += strtof("0x1.0p-126", NULL);
} while (prev != curr);

std::cout << std::hexfloat << curr << std::endl;

This code, given sufficient time to actually finish computing, would result in 0x1.000000p-102, not, as could be naively expected, 0x1.fffffep127 (Change of curr initialization to `strtof("0x1.fff000p-103") would be advised to actually observe this.); this illustrates how proportion between intermediate results of addition and particular addends affects the result.

A lot has been said about obtaining best precision, eg. this question.

The problem at hand differs in that we do not want to maximize precision, but we have a well-defined function that needs to be implemented exactly.

While for some the idea that it may be useful exercise seems controversial at best, consider the following scenario: comparison between those value sets could be a cornerstone of other operations performed on entire datasets independently in various environments. Synchronized, flawless operation of some systems may depend on this comparison being well defined and deterministically implemented, irregardless of addends order and particular architecture implementing IEEE754 or not.

This, or just curiosity.

In the discussion, Kahan summation algorithm has been mentioned as relevant. However this algorithm is a reasonable attempt at minimizing error. It neither guarantees correct sign of result, nor is independent of the order of operations (to at least guarantee consistent, if wrong, result, for permutations of sets).

One of the most obvious solutions would be to employ/implement fixed point arithmetic using sufficient amount of bits to represent every possible operand value exactly and keep exact intermediate result.

Perhaps however this can be done using only floating point arithmetic in a manner that guarantees correct sign of result. If so, the problem of overflow (as illustrated in one of the examples above) needs to be addressed in solution, as this question has particular technical aspect.

(What follows is original question.)

I have two sets of multiple floating point (float or double) values. I want to provide a perfect answer to the question, which set has larger sum. Because of artifacts in floating point arithmetic, in some corner cases the result of naive approach may be wrong, depending on order of operations. Not to mention simple sum can result in overflow. I can't provide any effort on my side, because all I have is vague ideas, all of them complicated and not convincing.

All comments removed as they were getting noisy and veering into antagonistic territory. If there is anything that still needs clarifying repost your comment but **be civil** — ChrisF, Jan 18 '17 at 21:10
Hopefully this will be reopened soon, but what you want is called a "superaccumulator". Some relevant code and references are here: https://github.com/aseldawy/sumn — Simon Byrne, Jan 18 '17 at 21:11
If there is additional information that can be added to this question to make it clear that it is not a duplicate, please edit it into the question. — Mark Ransom, Jan 18 '17 at 21:11
This question does not require any additional information. I could provide additional information as soon as it is agreed that it is not required but just helpful. — szpanczyk, Jan 18 '17 at 21:12
@EOF The accepted solution for this is to use Kahan's Summation. That's spelled out in one of the answers to: http://stackoverflow.com/q/6699066/2642059 — Jonathan Mee, Jan 18 '17 at 21:13
@szpanczyk any clarifying comments you made earlier are now lost, which is why I ask you to edit the question. And a question can be considered duplicate if the answers to that question answer yours as well. — Mark Ransom, Jan 18 '17 at 21:16
And I can clarify the question, sure, but all my clarifying was actually informing that what should have been assumed as default was indeed what I meant. For example, if I did not provide information on limited range of values in the question, there is no reason to assume this information is missing. It's obvious, that there is no limited range and all values are possible. — szpanczyk, Jan 18 '17 at 21:17
@JonathanMee Kahan summation, while more accurate, still isn't exact. e.g. try adding `1e40`, `1e20` and `1.0` — Simon Byrne, Jan 18 '17 at 21:18
As for inifinites and nans, those are specific values which as such do not pose problem for any solution, because they can be checked for in constant time per variable. — szpanczyk, Jan 18 '17 at 21:18
@SimonByrne I'm just saying that it's the accepted solution for this problem. If we needed an exact solution we wouldn't be using floating point numbers in the first place. — Jonathan Mee, Jan 18 '17 at 21:19
I may have made a mistake somewhere in the question, forgotten about something or made something unclear, but nobody as of yet has pointed me to such mistake that I would accept as mistake. If I was politely asked to clarify anything that doesn't necessarily require clarifying, I would have just done it without hesitation. But I wasn't, I was told that my question is missing some information and generally is a bad question. Which it is not. — szpanczyk, Jan 18 '17 at 21:20
@JonathanMee there are perfectly valid reasons for wanting exact sums of floating point numbers. e.g. computational geometry — Simon Byrne, Jan 18 '17 at 21:22
I have already said both: I assume the values are exact because such are the constraints of my problem. Why? There may be various reasons, one of which is to provide reliable comparison that does not depend on ordering of values. This is a good reason for this question, but I don't think that every question must have it's reason stated explicitly. If a question is good and could have been asked for various reasons, why would You need particular one stated? I think I have had enough. It just can't be reasoned in some circumstances and my time is probably better spent elsewhere. — szpanczyk, Jan 18 '17 at 21:23
Assume this comparison is to be performed in many, many environments. One of which is a C++ implementation of some service. Reliable is the keyword here - I want for those two sets of data to have the exact same result of comparison everywhere. So I have to establish a standard on what should be the right answer, that is perfectly precise. It doesn't matter that precision was lost upon storing data in this format. It's irrelevant. — szpanczyk, Jan 18 '17 at 21:26
@SimonByrne You're saying that there is a need for an exact solution using floating point numbers o.O Floating point numbers are inherently inexact? Is there an alternative to Kahan's Summation that you're proposing, cause if so, you've snagged my interest. — Jonathan Mee, Jan 18 '17 at 21:26
@JonathanMee Floating point numbers themselves are exact, it's the operations with them (arithmetic, conversion to/from decimal) that are inexact. As I mentioned in a comment above, there are algorithms called superaccumulators which can be used to compute exact sums. — Simon Byrne, Jan 18 '17 at 21:30
Let me say that I think this question is a duplicate of: http://stackoverflow.com/q/6699066/2642059 but other than that I've read through [the asking guidelines](http://stackoverflow.com/help/how-to-ask) and the only point that I'm seeing this answer failing on is the "Help Others Reproduce the Problem" section. — Jonathan Mee, Jan 18 '17 at 21:35
I mean, I can provide code that shows how order of addition changes result of comparison, but do I really need to? Isn't that a little bit too obvious for those that may be interested in the problem? — szpanczyk, Jan 18 '17 at 21:37
@szpanczyk In my proud opinion an example would have been beneficial to the question. (Though I wouldn't go add it at this point.) — Jonathan Mee, Jan 18 '17 at 21:37
Yeah it could have been helpful. Now, why would it be? In my opinion not because it is needed for the merit discussion. Just because if there was an example provided, it would help the odds. Because after asking a couple of questions recently, I really think this is a matter of chance. If You happen to be well understood, Your question will be upvoted. If You're misunderstood, there's just no reasoning with downvoters. — szpanczyk, Jan 18 '17 at 21:39
http://stackoverflow.com/questions/41660299/using-objects-of-two-or-more-data-types-as-key-for-c-map-oop they told me to provide mcv example. Of what? This kind of thing is useful if there is a dissonance between understanding and observed phenomena, I thought my code would work, it doesn't. My question was not of the kind, yet two people commented I should provide minimal example of my own code. It seems some people on this website are really not helpful. — szpanczyk, Jan 18 '17 at 21:41
@SimonByrne So my understanding of superaccumulators was just that you'd use multiple accumulators where Kahan's summation only uses one. Is that accurate? — Jonathan Mee, Jan 18 '17 at 21:42
This post may have suffered because no explicit question was ever asked. This site has a history of dismissing "questions" that just describe a program that someone should write. — Drew Dormann, Jan 18 '17 at 21:42
My next question is dedicated to you personally - please, tell me how did you become a telepath. — szpanczyk, Jan 18 '17 at 21:44
@JonathanMee Yes, that's more or less the idea (though I would say that Kahan summation uses 2 accumulators). — Simon Byrne, Jan 18 '17 at 21:45
@szpanczyk don't get upset on my account. That's not my goal. — Drew Dormann, Jan 18 '17 at 21:47
@szpanczyk I glanced at your linked question you have an answer from [dasblinkenlight](http://stackoverflow.com/users/335858/dasblinkenlight) who is one of the brightest minds on here. I share your frustration with asking [tag:c++] questions though. That tag is commonly known as the shark-tank. It just has a lot of negativity. Ultimately I resolved my thoughts in the meta here: http://stackoverflow.com/users/335858/dasblinkenlight I hope you find this encouraging. — Jonathan Mee, Jan 18 '17 at 21:48
I don't know what's you goal. I am upset because this whole situation was very frustrating for me. This is a valid question and I am not yet convinced Kahan summation solves the problem the way I need. I feel I was treated unjustly, I am still sure there is nothing wrong with my question (maybe it could be better, but it's good already), it was asked with best intentions and after all that all that was missing in this discussion was accusing me of laziness and wanting others to provide me with a working program. What You wrote was inconsiderate and not thoughtful at all. That's why I am upset. — szpanczyk, Jan 18 '17 at 21:50
@SimonByrne Yeah, I hadn't considered the idea of using more accumulators, but thanks to you I started reading through a paper on accumulator distribution. It's quite an interesting topic. Thanks for the tip! — Jonathan Mee, Jan 18 '17 at 21:50
The answer on that question was good. I really like this website. It's just annoying when someone declares there si something wrong with your question and does not have to prove it. I know the mechanism. People with certain amount of points get to close questions. It would be nice if they did not act like despotic monarchs, but rather restrain themselves to the role of judges. — szpanczyk, Jan 18 '17 at 21:52
@szpanczyk My apologies. I'd encourage you to spend more time with Kahan's Summation, if you feel like that doesn't totally solve your problem I'd encourage you to ask a follow up question on it. (If you place a link here I'd be happy to come assist.) — Jonathan Mee, Jan 18 '17 at 21:53
"which set has larger sum" --> This can be converted into another problem. By negating each element of the 2nd group and simply adding the 2 groups together, all we need to do is find if the sum if + or - or 0. The [Kahan summation](https://en.wikipedia.org/wiki/Kahan_summation_algorithm) is similar, yet we do not need to know the sum, just the sign of the sum. Now how to precisely add and maintain the sign .... — chux - Reinstate Monica, Jan 18 '17 at 23:08
@chux I have been thinking about it, and I think this still is non-trivial. Does this address the issue of intermediate result overflowing? — szpanczyk, Jan 19 '17 at 00:14
It takes a vote, and evidently enough people thought the question was worthwhile. As far as keeping it open, I'd put as much detail and supporting information into the question as you can. For example, in a comment that was deleted you had the reason why you needed the comparison to be stable. If the duplicates you've seen don't solve your problem, mention why. — Mark Ransom, Jan 19 '17 at 00:27
And I have decided to create a rewrite to use the opportunity that question reopening has given to show with superfluous content how this was a proper question from the beginning. Everything I mentioned, people of good will would have been either able to correctly assume as obvious (fe. in absence of range limit for values, it's obvious all values are to be expected, including NaNs and infs whose presence is nevertheless irrelevant.). — szpanczyk, Jan 19 '17 at 03:02
Whatever clarification would've been asked for with respect to common courtesy and with recognition of the fact that problem was already defined well enough, would've been gladly given and probably edited into the text of question. But if somebody judges and acts rather than ask first to give someone benefit of doubt, that's a different case completely. — szpanczyk, Jan 19 '17 at 03:04
Possible duplicate of [In which order should floats be added to get the most precise result?](http://stackoverflow.com/questions/6699066/in-which-order-should-floats-be-added-to-get-the-most-precise-result) — Jonathan Mee, Jan 19 '17 at 12:16
@JonathanMee do you genuinely don't understand the difference between minimizing error and obtaining perfect answer in a well defined bool function? — szpanczyk, Jan 19 '17 at 14:42
@szpanczyk As I mention in my answer: You will not obtain a perfect answer with a floating point number. Does that answer your question? I may have misunderstood what you're asking. — Jonathan Mee, Jan 19 '17 at 14:50
Yes, You did misunderstand it. I tried to explain that since yesterday, this is one of the reasons why I'm so frustrated and fed up. I can't help You anymore, I've put more than a fair share of effort and now all I can say is - read the question once again perhaps. — szpanczyk, Jan 19 '17 at 14:59
"While for some the idea that it may be useful exercise seems controversial at best, consider the following scenario: comparison between those value sets could be a cornerstone of other operations performed on entire datasets independently in various environments. Synchronized, flawless operation of some systems may depend on this comparison being well defined and deterministically implemented, irregardless of addends order and particular architecture implementing IEEE754 or not. This, or just curiosity." — szpanczyk, Jan 19 '17 at 15:06
Perhaps we're in a setting where we take responsibility for results of this comparison, and we're not supposed to make any assumptions about acceptable approximation. We're given data in this form from the outside and we're doing whatever we're doing, but have to treat every value as equal exactly what it represents in IEEE754. As was earlier stated - values are exactly what they are, it's the operations that are approximate. I don't have a problem with operations yielding 1000.0 when the correct answer would be 0.001. All I care about is the sign. It has to be perfectly correct. — szpanczyk, Jan 19 '17 at 15:09

score 6 · Answer 1 · answered Jan 19 '17 at 11:59

One possible approach is to compute the sum using a superaccumulator: this is an algorithm for computing exact sums of floating point numbers. Although these ideas have been around for a while, the term is a relatively new one.

In some sense, you can think of it as an extension of Kahan summation, where the sequential sum is stored as an array of values, rather than just a pair. The main challenge then becomes figuring out how to allocate the precision amongst the various values.

Some relevant papers and code:

Y. K. Zhu and W. B. Hayes. "Algorithm 908: Online Exact Summation of Floating-Point Streams". ACM Transactions on Mathematical Software (ACM TOMS), 37(3):37:1-37:13, September 2010. doi: 10.1145/1824801.1824815
- Unfortunately the paper and code are behind a paywall, but this appears to be the C++ code.
R. M. Neal, "Fast Exact Summation using Small and Large Superaccumulators". 2015. arXiv: 1505.05571
- C code available
M. T. Goodrich, A. Eldawy "Parallel Algorithms for Summing Floating-Point Numbers". 2016. arXiv: 1605.05436
- Java code of this and the above

Incidentally you can find pseudo-code for Kahan's Summation here: https://en.wikipedia.org/wiki/Kahan_summation_algorithm — Jonathan Mee, Jan 19 '17 at 12:17

chux - Reinstate Monica · Answer 2 · 2017-01-19T18:27:13.323

Post was originally also a C one and so my code applies to that.
I now see post is C++ only, yet I see little in the following that would not readily apply to C++.

Simplify to finding the sign of the sum of a list of FP numbers

Comparing 2 sets of numbers is like appending the negation of the second set to the first and then finding the sign of the sum of the joint list. This sign maps to >, == or < of the 2 original sets.

Perform only exact FP math

Assumption: FP employs IEEE like numbers including sub-normals, base 2, and is exact for certain operations:

Addition of a +b with the same binary exponent and differing sign.
Subtraction of same sign 0.5 from a number in the 0.5 <= |x| < 1.0 range.
ldexp*() (break number into significant and exponent parts) function returns an exact value.

Form array per exponent

Form an array of sums sums[] whose values will only ever be (0 or 0.5 <= |sums[i]| < 1.0), one for each possible exponent and for some exponents larger than the max. The larger ones are needed to accumulate a |total_sum| that exceeds FP_MAX. This needs up to log2(SIZE_MAX) more elements.

Add the set of numbers to sums[]

For each element of the number set, add it to the corresponding sums[] per its binary exponent. This is key as addition of same sign and differing sign FP numbers with a common FP binary exponent can be done exactly. The addition may result in a carry with same sign values and cancellation with differing sign values - this is handled. The incoming set of numbers need not be sorted.

Normalize sum[]

For each element on ones[], insure any values not 0.5, 0.0 or -0.5 is reduced, the remaining portion added to smaller ones[].

Inspect sum[] for the most significant digit

The most significant (non-zero) one[s] is the sign of the result.

The below code performs the task using float as the set's FP type. Some parallel calculations are done using double to check for sanity, but do not contribute to the float calculation.

The normalizing step in the end iterates typically twice. Even a worst case set, I suspect would iterate the binary width of the float signicand, about 23 times.

The solution appears to be about O(n), but does use an array about the size of the FP's exponent range.

#include <assert.h>
#include <stdbool.h>
#include <float.h>
#include <stdio.h>
#include <time.h>
#include <stdint.h>
#include <stdlib.h>
#include <math.h>

#if RAND_MAX/2 >= 0x7FFFFFFFFFFFFFFF
#define LOOP_COUNT 1
#elif RAND_MAX/2 >= 0x7FFFFFFF
#define LOOP_COUNT 2
#elif RAND_MAX/2 >= 0x1FFFFFF
#define LOOP_COUNT 3
#elif RAND_MAX/2 >= 0xFFFF
#define LOOP_COUNT 4
#else
#define LOOP_COUNT 5
#endif

uint64_t rand_uint64(void) {
  uint64_t r = 0;
  for (int i = LOOP_COUNT; i > 0; i--) {
    r = r * (RAND_MAX + (uint64_t) 1u) + ((unsigned) rand());
  }
  return r;
}

typedef float fp1;
typedef double fp2;

fp1 rand_fp1(void) {
  union {
    fp1 f;
    uint64_t u64;
  } u;
  do {
    u.u64 = rand_uint64();
  } while (!isfinite(u.f));
  return u.f;
}

int pre = DBL_DECIMAL_DIG - 1;


void exact_add(fp1 *sums, fp1 x, int expo);

// Add x to sums[expo]
// 0.5 <= |x| < 1
// both same sign.
void exact_fract_add(fp1 *sums, fp1 x, int expo) {
  assert(fabsf(x) >= 0.5 && fabsf(x) < 1.0);
  assert(fabsf(sums[expo]) >= 0.5 && fabsf(sums[expo]) < 1.0);
  assert((sums[expo] > 0.0) == ( x > 0.0));

  fp1 half = x > 0.0 ? 0.5 : -0.5;
  fp1 sum = (sums[expo] - half) + (x - half);
  if (fabsf(sum) >= 0.5) {
    assert(fabsf(sums[expo]) < 1.0);
    sums[expo] = sum;
  } else  {
    sums[expo] = 0.0;
    if (sum) exact_add(sums, sum, expo);
  }
  exact_add(sums, half, expo+1);  // carry
}

// Add  x to sums[expo]
// 0.5 <= |x| < 1
// differing sign
void exact_fract_sub(fp1 *sums, fp1 x, int expo) {
  if(!(fabsf(x) >= 0.5 && fabsf(x) < 1.0)) {
    printf("%d %e\n", __LINE__, x);
    exit(-1);
  }
  assert(fabsf(x) >= 0.5 && fabsf(x) < 1.0);
  assert((sums[expo] > 0.0) != ( x > 0.0));
  fp1 dif = sums[expo] + x;
  sums[expo] = 0.0;
  exact_add(sums, dif, expo);
}

// Add x to sums[]
void exact_add(fp1 *sums, fp1 x, int expo) {
  if (x == 0) return;
  assert (x >= -FLT_MAX && x <= FLT_MAX);
  //while (fabsf(x) >= 1.0) { x /= 2.0; expo++; }
  while (fabsf(x) < 0.5) { x *= (fp1)2.0; expo--; }
  assert(fabsf(x) >= 0.5 && fabsf(x) < 1.0);

  if (sums[expo] == 0.0) {
    sums[expo] = x;
    return;
  }
  if(!(fabsf(sums[expo]) >= 0.5 && fabsf(sums[expo]) < 1.0)) {
    printf("%e\n", sums[expo]);
    printf("%d %e\n", expo, x);
    exit(-1);
  }
  assert(fabsf(sums[expo]) >= 0.5 && fabsf(sums[expo]) < 1.0);
  if ((sums[expo] > 0.0) == (x > 0.0)) {
    exact_fract_add(sums, x, expo);
  } else {
    exact_fract_sub(sums, x, expo);
  }
}

void exact_add_general(fp1 *sums, fp1 x) {
  if (x == 0) return;
  assert (x >= -FLT_MAX && x <= FLT_MAX);
  int expo;
  x = frexpf(x, &expo);
  exact_add(sums, x, expo);
}

void sum_of_sums(const char *s, const fp1 *sums, int expo_min, int expo_max) {
  fp1 sum1 = 0.0;
  fp2 sum2 = 0.0;
  int step = expo_max >= expo_min ? 1 : -1;
  for (int expo = expo_min; expo/step <= expo_max/step; expo += step) {
    sum1 += ldexpf(sums[expo], expo);
    sum2 += ldexp(sums[expo], expo);
  }
  printf("%-20s = %+.*e %+.*e\n", s, pre, sum2, pre, sum1);
}


int test_sum(size_t N) {
  fp1 a[N];
  fp1 sum1 = 0.0;
  fp2 sum2 = 0.0;
  for (size_t i = 0; i < N; i++) {
    a[i] = (fp1) rand_fp1();
    sum1 += a[i];
    sum2 += a[i];
  }
  printf("%-20s = %+.*e %+.*e\n", "initial  sums", pre, sum2, pre, sum1);

  int expo_min;
  int expo_max;
  frexpf(FLT_TRUE_MIN, &expo_min);
  frexpf(FLT_MAX, &expo_max);
  size_t ln2_size = SIZE_MAX;
  while (ln2_size > 0) {
    ln2_size >>= 1;
    expo_max++;
  };
  fp1 sum_memory[expo_max - expo_min + 1];
  memset(sum_memory, 0, sizeof sum_memory);  // set to 0.0 cheat
  fp1 *sums = &sum_memory[-expo_min];

  for (size_t i = 0; i<N; i++)  {
    exact_add_general(sums, a[i]);
  }
  sum_of_sums("post add  sums", sums, expo_min,  expo_max);

  // normalize
  int done;
  do {
    done = 1;
    for (int expo = expo_max; expo >= expo_min; expo--) {
      fp1 x = sums[expo];
      if ((x < -0.5) || (x > 0.5)) {
        //printf("xxx %4d %+.*e ", expo, 2, x);
        done = 0;
        if (x > 0.0) {
          sums[expo] = 0.5;
          exact_add(sums, x - (fp1)0.5, expo);
        } else {
          sums[expo] = -0.5;
          exact_add(sums, x - -(fp1)0.5, expo);
        }
      }
    }
    sum_of_sums("end  sums", sums, expo_min,  expo_max);
  } while (!done);

  for (int expo = expo_max; expo >= expo_min; expo--) {
    if (sums[expo]) {
      return (sums[expo] > 0.5) ? 1 : -1;
    }
  }
  return 0;
}

#define ITERATIONS 10000
#define MAX_NUMBERS_PER_SET 10000
int main() {
  unsigned seed = (unsigned) time(NULL);
  seed = 0;
  printf("seed = %u\n", seed);
  srand(seed);

  for (unsigned i = 0; i < ITERATIONS; i++) {
    int cmp = test_sum((size_t)rand() % MAX_NUMBERS_PER_SET + 1);
    printf("Compare %d\n\n", cmp);
    if (cmp == 0) break;
  }
  printf("Success");
  return EXIT_SUCCESS;
}

Infinities and NaN can also be handled, to a degree, leave that for later.

Thank You for an on point answer. `c` tag was removed by someone else, as part of the whole know-it-all-better deal that took place. — szpanczyk, Jan 19 '17 at 18:35
It seems this answer is much better than mine, in terms of performance. I'll wait some more to see if other answers emerge. — szpanczyk, Jan 19 '17 at 18:37
Differences between [this](http://stackoverflow.com/a/41748978/2410359) and [that](http://stackoverflow.com/a/41741276/2410359) are 1) explicit precision requirements of `log2(FLT_MAX) - log2(FLT_TRUE_MIN) + log2(SIZE_MAX)` vs. a TBD `log2(superaccumulator)` 2) efficiency of having the array `sums[]` vs. one very _wide superaccumulator_ and 3) reality vs. conceptual. — chux - Reinstate Monica, Jan 21 '17 at 08:47

score 0 · Answer 3 · answered Jan 19 '17 at 14:05

0

The floating point number resulting from the summation of 2 floating point numbers is only an approximation. Given i₁ and i₂ to sum we can find an approximation of the error in floating point summation by doing this:

i₁ + i₂ = i₁₂
i₁₂ - i₂ = i_~1
i₁ - i_~1 = i_Δ

The closest approximation we could come up with for the summation of n numbers would be the to calculate the error for the n - 1 addition operations, then to sum those n - 1 errors again taking the n - 2. And you'll repeat this process n - 2 times or until all the errors have gone to 0.0

There are a couple things that could be done to drive the error calculations to 0.0 faster:

Use a larger floating point type, for example long double
Sort the list prior to summing so that you're adding small numbers to small numbers and large numbers to large numbers

Now you can make an assessment of how important accuracy is to you. I will tell you that in the general case the computational expense of the above operation is outrageous considering the result you get will still be an approximation.

The generally accepted solution is Kahan's Summation it's a happy marriage between speed and precision. Rather than holding the error to the end of the summation Kahan will roll it into each addition, preventing it's value from escalating outside the hghest precision floating point range. Say that we're given vector<long double> i1 we could run Kahan's Summation on it as follows:

auto c = 0.0L;
const auto sum = accumulate(next(cbegin(i1)), cend(i1), i1.front(), [&](const auto& sum, const auto& input) {
    const auto y = input - c;
    const auto t = sum + y;

    c = t - sum - y;
    return t;
} ) - c;

answered Jan 19 '17 at 14:05

Jonathan Mee

37,899
23
129
288

You're kind of saying that because what I need to do is costly, I should do something else. – szpanczyk Jan 19 '17 at 14:39
@szpanczyk More specifically I'm saying that pouring massive computation into finding the exact number but using floating point numbers to evaluate is wasted effort. It's like measuring a cut to the picometer, then making the cut with a chainsaw. Measuring to the millimeter is more than sufficient. Since you can't represent that level of precision with your tool. – Jonathan Mee Jan 19 '17 at 14:55
The point where you're mistaken is the order of operations. What if you're making the cut with a chainsaw, and then measuring it to the milimeter? Anyway, rationale for this question was presented in revised version, you're free to ignore that. – szpanczyk Jan 19 '17 at 14:58
@szpanczyk Clearly I've misunderstood something. I read through the question and thought that I was answering it? Perhaps you can elaborate on your order of operations statement? – Jonathan Mee Jan 19 '17 at 15:01
"While for some the idea that it may be useful exercise seems controversial at best, consider the following scenario: comparison between those value sets could be a cornerstone of other operations performed on entire datasets independently in various environments. Synchronized, flawless operation of some systems may depend on this comparison being well defined and deterministically implemented, irregardless of addends order and particular architecture implementing IEEE754 or not. This, or just curiosity." – szpanczyk Jan 19 '17 at 15:12
Perhaps we're in a setting where we take responsibility for results of this comparison, and we're not supposed to make any assumptions about acceptable approximation. We're given data in this form from the outside and we're doing whatever we're doing, but have to treat every value as equal exactly what it represents in IEEE754. As was earlier stated - values are exactly what they are, it's the operations that are approximate. I don't have a problem with operations yielding 1000.0 when the correct answer would be 0.001. All I care about is the sign. It has to be perfectly correct. – szpanczyk Jan 19 '17 at 15:12
@szpanczyk The result of the total comparison will always be inexact, there's no escaping it while using floating point numbers. Whether you're summing both sets and comparing results or finding the difference between each pair from the sets then summing those differences, you must perform operations which will give you inexact results, this inexactness may well cause a change in the final sign as well. Summing both sets with Kahan's and comparing the results is as about as good as you can do. – Jonathan Mee Jan 19 '17 at 15:27
Compare to my answer. You're assuming constraints that just aren't there in the question. – szpanczyk Jan 19 '17 at 15:28
@szpanczyk Fixed point is an option. It will give you exact results. I'd encourage you to attempt implementation though. You won't be able to use a `bitset` as you've suggested. `vector` will work. If you can demonstrate functionality rather than just hand waving I'd be thrilled to offer your answer a +1. – Jonathan Mee Jan 19 '17 at 15:41
I am only interested in alternatives to my concept, not in points. If my answer was the best answer to my own question, that would leave me dissatisfied. – szpanczyk Jan 19 '17 at 16:13
1

@szpanczyk If you're saying input floating point numbers are perfectly accurate already and you must preserve perfect accuracy, the only way to do that is fixed point math, exactly as you have suggested. If you're interested you will probably find Dr. Dobb's article on fixed point helpful: http://www.drdobbs.com/cpp/fixed-point-arithmetic-types-for-c/184401992?pgno=1 – Jonathan Mee Jan 19 '17 at 16:24
I do not need perfect accuracy. I use perfect accuracy to perform comparison to zero with certainty. However since I am comparing to a very specific value, which zero is, perhaps there is room for alternative approaches. – szpanczyk Jan 19 '17 at 16:30
@szpanczyk and Jonathan, the key consideration here is that this is not about calculating a single sum, it's about the *difference* between *two* sums. It took me a while yesterday to grasp the distinction. Obviously one approach is to calculate both sums perfectly, but there might be other approaches as well. – Mark Ransom Jan 19 '17 at 17:19
If the general direction to take is to calculate sum, any error is acceptable as long, as it can be proven the sign is correct. – szpanczyk Jan 19 '17 at 19:10
Respected @MarkRansom, that's an interesting perspective. But I think a generic solution would _have to_ be able to calculate a single perfect sum. If one of the two sets contained only the number `0.0`, then the solution would need to observe the perfect sum of the other set. No? – Drew Dormann Jan 19 '17 at 22:01
1

@DrewDormann no, not true. If you could prove that the error for a positive sum would always be positive, and the error for a negative sum would always be negative, it would still work with your example. (I'm not claiming such an algorithm exists, just that I don't dismiss it out of hand). – Mark Ransom Jan 19 '17 at 23:03

szpanczyk · Answer 4 · 2017-01-19T15:38:11.047

One of the possibilities to perform this comparison with certainty is to create a class for fixed point arithmetic of precision equal to the types in use and without limit on absolute value.

It could be a class implementing following public methods:

    FixedPoint(double d);
    ~FixedPoint();

    FixedPoint operator+(const FixedPoint& rhs);
    FixedPoint operator-(const FixedPoint& rhs);
    bool isPositive();

(Every supported floating point type needs separate constructor.)

Depending upon circumstances implementation would require a collection of bool of fixed, decided upon construction or dynamic size; possibly std::bitset, vector<bool> or static or dynamic bool array.

For ease of implementation I would suggest implementing the 2's complement encoding.

This is an obvious and very costly solution, that would hurt performance if this comparison was core of some system. Hopefully there is a better solution.

comparing two sums of floating point values in C or C++

4 Answers4

Linked