Assume You're given two sets of floating point variables implemented according to IEEE754, meant to be treated as exact values calculated according to formulae present in standard. All legal values are possible. The amount of variables in set may be any natural number.
What would be a good way to compare exact, in mathematical sense, sums of values represented by said variables. Due to domain's nature, the problem can easily be represented as comparing a single sum to zero. You can disregard the possibility of presence of NaNs or Infinities, as it is irrelevant to core problem. (Those values can be checked for easily and independently, and acted upon in a manner suiting particular application of this problem.)
A naive approach would be to simply sum and compare, or sum values of one set and subtract values of another.
bool compare(const std::vector<float>& lhs, const std::vector<float>& rhs)
{
float lSum = 0.0f;
for (auto value : lhs)
{
lSum += value;
}
float rSum = 0.0f;
for (auto value : rhs)
{
rSum += value;
}
return lSum < rSum;
}
Quite obviously there are problems with naive approach, as mentioned in various other questions regarding floating point arithmetic. Most of the problems are related to two difficulties:
- result of addition of floating point values differs depending on order
certain orders of addition of certain sets of values may result in intermediate overflow (intermediate result of calculations goes beyond range supported by available data type)
float small = strtof("0x1.0p-126", NULL); float big = strtof("0x1.8p126", NULL); std::cout << std::hexfloat << small + big - big << std::endl; std::cout << std::hexfloat << (big-2*small) + (big-small) + big - (big+small) - (big+2*small) << std::endl;
This code will result in
0
andinf
; this illustrates how ordering affects the result. Hopefully, also that the problem of ordering is non-trivial.float prev; float curr = 0.0f; do { prev = curr; curr += strtof("0x1.0p-126", NULL); } while (prev != curr); std::cout << std::hexfloat << curr << std::endl;
This code, given sufficient time to actually finish computing, would result in 0x1.000000p-102
, not, as could be naively expected, 0x1.fffffep127
(Change of curr initialization to `strtof("0x1.fff000p-103") would be advised to actually observe this.); this illustrates how proportion between intermediate results of addition and particular addends affects the result.
A lot has been said about obtaining best precision, eg. this question.
The problem at hand differs in that we do not want to maximize precision, but we have a well-defined function that needs to be implemented exactly.
While for some the idea that it may be useful exercise seems controversial at best, consider the following scenario: comparison between those value sets could be a cornerstone of other operations performed on entire datasets independently in various environments. Synchronized, flawless operation of some systems may depend on this comparison being well defined and deterministically implemented, irregardless of addends order and particular architecture implementing IEEE754 or not.
This, or just curiosity.
In the discussion, Kahan summation algorithm has been mentioned as relevant. However this algorithm is a reasonable attempt at minimizing error. It neither guarantees correct sign of result, nor is independent of the order of operations (to at least guarantee consistent, if wrong, result, for permutations of sets).
One of the most obvious solutions would be to employ/implement fixed point arithmetic using sufficient amount of bits to represent every possible operand value exactly and keep exact intermediate result.
Perhaps however this can be done using only floating point arithmetic in a manner that guarantees correct sign of result. If so, the problem of overflow (as illustrated in one of the examples above) needs to be addressed in solution, as this question has particular technical aspect.
(What follows is original question.)
I have two sets of multiple floating point (float or double) values. I want to provide a perfect answer to the question, which set has larger sum. Because of artifacts in floating point arithmetic, in some corner cases the result of naive approach may be wrong, depending on order of operations. Not to mention simple sum can result in overflow. I can't provide any effort on my side, because all I have is vague ideas, all of them complicated and not convincing.