Accuracy of Rosenbrock's test function calculation

Question

I want to calculate Rosenbrock's test function

enter image description here

I have implemented the following C/C++ code

#include <stdio.h>

/********/
/* MAIN */
/********/
int main()
{
    const int N = 900000;

    float *x = (float *)malloc(N * sizeof(float));
    for (int i=0; i<N; i++) x[i] = 3.f;

    float sum_host = 0.f;
    for (int i=0; i<N-1; i++) {
        float temp = (100.f * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - 1.f) * (x[i] - 1.f));
        sum_host = sum_host + temp;
        printf("%i %f %f\n", i, temp, sum_host);
    }
    printf("Result for Rosenbrock's test function calculation = %f\n", sum_host);

}

Since the x array is initialized to 3.f, then each summation term should be 3604.f, so that the final summation involving 899999 terms should be 3243596396. However, the result I get is 3229239296, with an absolute error of 14357100. If I measure the difference between two consecutive partial summations, I see that it is 3600.f for the early partial summations and then it drops to 3584 for the last ones, while it should always be 3604.f.

If I use Kahan summation algorithm as

sum_host = 0.f;
float c        = 0.f;
for (int i=0; i<N-1; i++) {
    float temp = (100.f * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - 1.f) * (x[i] - 1.f)) - c;
    float t    = sum_host + temp;
    c          = (t - sum_host) - temp;
    sum_host = t;
}

the result I get is 3243596288, with a much smaller absolute error of 108.

I'm pretty sure that this effect I see should be ascribed to the precision of floating point arithmetics. Could someone confirm this and provide me an explanation of the mechanism according to which this occurs?

It starts at 3604 for me, rather than 3600. Did you make a typo? — tmyklebu, Jan 07 '15 at 22:29
@tmyklebu That value (`3600`) does not refer to the very first finite difference between consecutive partial summations, but to (about) the `26000`th difference. — Vitality, Jan 07 '15 at 22:34
@JackOLantern The solution to this problem, in case you missed it in Chux's answer is to replace every instance of the word `float` with the word `double`, and fix the constants as well, e.g. `3.f` becomes `3.0`. The calculation that you're doing won't exceed the precision of a `double`, and hence the results of the calculation will have 0 error. — user3386109, Jan 07 '15 at 23:01
@user3386109 Agree with the suggestion about changing `float` to `double` except that the array `*x` may remain `float` for space considerations. Simple way to make `x[i] * x[i]` work at `double` is to do `1.0 * x[i] * x[i]` or `(double) x[i] * x[i]`. — chux - Reinstate Monica, Jan 08 '15 at 00:08
@chux To quote some anonymous wise guy, "Premature optimization is the root of all evil." Which is to say that you shouldn't optimize for space unless space is actually a problem. Personally, I've never worked with a computer that could store 900000 floats, but couldn't just as easily store 900000 doubles. (In the good old days, I did work with PCs that couldn't store 900000 doubles, but they couldn't store 900000 floats either.) — user3386109, Jan 08 '15 at 00:25

tmyklebu · Accepted Answer · 2015-01-07T22:55:10.257

You compute temp = 3604.0f accurately at each iteration. The problem arises when you try adding 3604.0f to something else and round the result to the nearest float. floats store an exponent and a 23-bit significand, meaning any result with 1-bits more than 24 places apart is going to get rounded to something other than what it is.

Note that 3604 = 901 * 4 and the binary expansion of 901 is 1110000101; you'll start seeing roundoff once you start adding temp to something bigger than 2^24 * 4 = 67108864. (This happens when you run the code, too; it starts printing out 3600 as the difference between consecutive sum_host's right when sum_host exceeds 67108864.) You start seeing even more roundoff when you're adding temp to something bigger than 2^26 * 4; at that point, the second smallest '1' bit is getting swallowed as well.

Note that, after you do Kahan summation, sum_host is what you report AND c is -108. This is loosely because c is keeping track of the next most significant 24 bits.

BTW: [binary32](http://en.wikipedia.org/wiki/Single-precision_floating-point_format) has 24-bit binary significand (23 bits stored explicitly)) — chux - Reinstate Monica, Jan 07 '15 at 22:53

score 2 · Answer 2 · answered Jan 07 '15 at 22:37

2

Typical float is only good to maybe 7 digits of precision. Repeatedly adding 3604 to a number 100000x larger than it does not well accumulate the lesser significant digits.

Use double.

answered Jan 07 '15 at 22:37

chux - Reinstate Monica

143,097
13
135
256

I downvoted because I don't think your answer addresses the question to any depth. Downvotes are reversible; if you improve your answer, I'm happy to turn the downvote into an upvote. – tmyklebu Jan 07 '15 at 22:42
@tmyklebu I posted, you posted 10 seconds later, you down voted 3 seconds later. Hmmm. This answer is terse and correct. It goes into enough depth to explain the issue and offers a solution. – chux - Reinstate Monica Jan 07 '15 at 23:13

Accuracy of Rosenbrock's test function calculation

2 Answers2