How to achieve maximal accuracy in the sum of series calculation?

Question

I wrote a code to calculate the sum of the series 2^(-k), but I don't know how to improve the accuracy of this calculation. This is what I've done so far.

#include <iostream>
#include <math.h>
using namespace std;

int main()
{
    int i, n;
    float sum = 0;

    cout << "Enter the value of n: ";
    cin >> n;

    for (i=1; i<=n; i++)
        sum += 1.0/pow(2,i);

    cout << "Sum: " << sum;


    return 0;
}

Any suggestion and/or help is very appreciated.

For increased precision, you could start by using `double` or `long double` instead of `float`. — Jesper Juhl, Apr 21 '19 at 09:44
You may use a `long double` to achieve better accuracy. Not that the default _accuracy_ for `double` or `float` output to a stream is 6, unless you change that with [`setprecision()`](https://en.cppreference.com/w/cpp/io/manip/setprecision). — πάντα ῥεῖ, Apr 21 '19 at 09:45
`1-std::ldexp(1,-n)`? This particular series is rather uninteresting, there isn't much rounding going on, so pretty much any technique will give the same answer (unless you set rounding towards infinity). — Marc Glisse, Apr 21 '19 at 09:54
I do not have enough rep to upvote all of your comments but thank you very much for your help! — kamacite, Apr 21 '19 at 11:29

score 0 · Accepted Answer · answered Apr 21 '19 at 10:12

To see more precise output, you need to request more precision than the C++ default. One way to do this is:

#include <iomanip>
…
   std::cout << std::setprecision(99);

Next, consider this code:

for (i=1; i<=n; i++)
    sum += 1.0/pow(2,i);

First, recognize that the quality of pow implementations varies. The C and C++ standards are lax about the quality of floating-point operations, and some pow implementations return results for simple cases like pow(10, 3) that differ slightly from the mathematical result. Due to the ways pow is frequently implemented, pow(2, i) might not suffer from this problem, but it should be considered.

Let’s assume pow(2, i) computes the proper result exactly. Let’s also assume that your C++ implementation uses the common IEEE-754 basic 32-bit binary floating-point format for float. If so, there is no error in the sum computed above for n ≤ 24.

This is because each term, 1.0/pow(2, i), is representable as a single bit in the significand (fraction part) of a float, and the float has 24-bit significands, so 24 consecutive bits can be represented with no error. Once you increase the precision used to format output, the sums shown for n ≤ 24 should be exact.

When n = 25, the sum no longer fits in a float. At this point, the mathematical result will be rounded to the nearest value representable in a float, generally using the rule that, if there is a tie between the two nearest representable values, the one with the even low bit will be chosen. This means the result will be 1, exactly. For all n > 24, the result will be 1.

While using the float type, it is not possible to increase the accuracy beyond this. That is because, of all the values that can be represented in the float type, 1 is the value closest to the exact mathematical sum of the series. There simply is no closer representable value, so no computation or alteration of source code can produce any more accurate value.

You can produce more accurate values by using double instead of float. If the IEEE-754 basic 64-bit binary format is used for double, then this will produce exact results for n ≤ 53. For n > 53, the result will again be 1, and the sum could be improved only by using extended-precision arithmetic.

Additionally, note that:

float sum = 0;
for (i=1; i<=n; i++)
    sum += 1.0/pow(2,i);

is mathematically equivalent to:

float sum = 1 - pow(2.f, (float) -n);

score -1 · Answer 2 · answered Apr 21 '19 at 09:28

-1

Hmm, i think pow uses base first, and the exponent, like this, also remember -i:

for (i=1; i<=n; i++)
    sum += pow(2,-i);

You can use double instead of float when declaring sum, to make it even more accurate (double uses more bits than float to represent a decimal number, so it has higher precision)

answered Apr 21 '19 at 09:28

karlPet

44
4

Thank you for your answer! I used the fact that 2^(-k) = 1/(2^k), that is why I wrote the power function differently. I tried to use double but it shows the same number as it would show with float. Edit: yes, now I see where I was wrong with the power function. Thank you once again! – kamacite Apr 21 '19 at 09:31
I see, 1.0/pow(2, i) works as well. Did you switch place of parameters in pow? – karlPet Apr 21 '19 at 09:31
Yes, I was mistaken, sorry about that. – kamacite Apr 21 '19 at 09:34
Then i believe that your calculations should be correct :) – karlPet Apr 21 '19 at 09:39
Actually, now I see that as I increase n, the series converges to 1 and the value does not change. So maybe the accuracy can be increased by increasing the value of n? – kamacite Apr 21 '19 at 09:41
Yes indeed, but more iterations means longer runtime, so you need to balance between accuracy/computation complexity – karlPet Apr 21 '19 at 09:47

How to achieve maximal accuracy in the sum of series calculation?

2 Answers2