Sieve of Eratosthenes run slower after improving

Question

I try to improve basic Sieve of Eratosthenes algorithm by avoiding cross out duplicate multiple of primes, but it turn out to be worse than my expectation

I has implemented two method that return primes in range [2..max)

Basic sieve

public static List<int> Sieve22Max_Basic(int n) {
    var primes = new List<int>();
    var sieve = new BitArray(n, true); // default all number are prime
    //int crossTotal = 0;

    int sqrt_n = (int)Math.Sqrt(n) + 1;
    for (int p = 2; p < sqrt_n; ++p) {
        if (sieve[p]) {
            primes.Add(p);
            //var cross = new List<int>();
            int inc = p == 2 ? p : 2 * p;
            for (int mul = p * p; mul < n; mul += inc) {
                // cross out multiple of prime p
                // cross.Add(mul);
                //++crossTotal;
                sieve[mul] = false;
            }

            //if (cross.Count > 0)
            //    Console.WriteLine($"Prime {p}, cross out: {string.Join(' ', cross)}");
        }
    }
    //Console.WriteLine($"crossTotal: {crossTotal:n0}");

    for (int p = sqrt_n; p < n; ++p)
        if (sieve[p])
            primes.Add(p);

    return primes;
}

Run Sieve22Max_Basic(100), see some multiples are cross more than one (ex: 45, 75, 63)

Prime 2, cross out: 4 6 8 ... 96 98
Prime 3, cross out: 9 15 21 27 33 39 45 51 57 63 69 75 81 87 93 99
Prime 5, cross out: 25 35 45 55 65 75 85 95
Prime 7, cross out: 49 63 77 91

Enhance Sieve

Then, I try to improve by using array that store smallest prime divisor (spd) of each number.

45 = 3 x 5      // spd[45] = 3
75 = 3 x 5 x 5  // spd[75] = 3
63 = 3 x 3 x 7  // spd[63] = 3

When go through multiple of prime p, I will not cross out number mul that have spd[mul] < p because mul was cross out by spd[mul] before

public static List<int> Sieve22Max_Enh(int n) {
    var sieve = new BitArray(n, true);
    var spd = new int[n];
    for (int i = 0; i < n; ++i) spd[i] = i;
    
    var primes = new List<int>();
    //int crossTotal = 0;

    int sqrt_n = (int)Math.Sqrt(n) + 1;
    for (int p = 2; p < sqrt_n; ++p) {
        if (sieve[p]) {
            primes.Add(p);

            //var cross = new List<int>();
            int inc = p == 2 ? 1 : 2;
            for (long mul = p; mul * p < n; mul += inc) {
                if (spd[mul] >= p) {
                    sieve[(int)(mul * p)] = false;
                    spd[mul * p] = p;
                    //++crossTotal;
                    //cross.Add((int)(mul * p));
                }
            }
            //if (cross.Count > 0)
            //    Console.WriteLine($"Prime {p}, cross out: {string.Join(' ', cross)}");
        }
    }
    //Console.WriteLine($"crossTotal: {crossTotal:n0}");

    for (int p = sqrt_n; p < n; ++p)
        if (sieve[p])
            primes.Add(p);

    return primes;
}

Test

I test on my laptop (core i7 - 2.6 Ghz), with n = 1 billion

Sieve22Max_Basic take only 6s while Sieve22Max_Enh take more than 10s to complete

var timer = new Stopwatch();
int n = 1_000_000_000;

timer.Restart();
Console.WriteLine("==== Sieve22Max_Basic ===");
var list = Sieve22Max_Basic(n);
Console.WriteLine($"Count: {list.Count:n0}, Last: {list[list.Count - 1]:n0}, elapsed: {timer.Elapsed}");

Console.WriteLine();

timer.Restart();
Console.WriteLine("==== Sieve22Max_Enh ===");
list = Sieve22Max_Enh(n);
Console.WriteLine($"Count: {list.Count:n0}, Last: {list[list.Count - 1]:n0}, elapsed: {timer.Elapsed}");

You can try at https://onlinegdb.com/tWfMuDDK0

What does it make slower?

`sieve[mul] = false` is a really cheap operation. You will not improve it with your more complex algorithm. — Klaus Gütter, Feb 17 '22 at 05:43
So the new code replaces `sieve[mul] = false;` with `if (spd[mul] >= p) {sieve[(int)(mul * p)] = false; spd[mul * p] = p;}`? You’ve saved an occasional overwrite of `false` with `false` by adding a conditional (based on a different array’s values and which is unpredictable) with a cast, multiplication, and a write to another far away part of the different array. Besides the extra work, this has poor cache locality and probably isn’t benefitting much from branch prediction. — kcsquared, Feb 17 '22 at 05:43
low level optimisations for CPU's are often counter intuitive. CPU's like predictable code with less branches. They like to decide which code to run next, before the results of calculations or memory values are available. If your hot loop has a branch in it based on a memory value, the CPU is going to waste a lot of time executing the wrong branch. — Jeremy Lakeman, Feb 17 '22 at 05:51
Basically, it's just faster to cross-out without checking than it is to check if it's already crossed out and branch if it is. Unpredictable branching tends be be much slower than unnecessary writes. — RBarryYoung, Feb 17 '22 at 16:08

bazzilic · Accepted Answer · 2022-02-17T06:21:33.160

Compare your two loops from the original and improved versions.

Original:

int inc = p == 2 ? p : 2 * p;
for (int mul = p * p; mul < n; mul += inc) {
  sieve[mul] = false;
}

Improved:

int inc = p == 2 ? 1 : 2;
for (long mul = p; mul * p < n; mul += inc) {
  if (spd[mul] >= p) {
    sieve[(int)(mul * p)] = false;
    spd[mul * p] = p;
  }
}

Some observations:

Both loops run the same amount of iterations.
For every iteration, the original loop executes three very fast operations: 1) change a value in BitArray, mul += inc and check mul < n.
For every iteration of the improved loop, we execute more operations: check spd[mul] >= p, mul += inc, mul * p (in the for-loop condition), check mul * p < n.
Increment += and loop condition check for < are the same in both loops; check spd[mul] >= p and change a value in BitArray are comparable in how long they take; but the additional op mul * p in the second loop's condition is multiplication – it's expensive!
But also, for every iteration of the second loop, if spd[mul] >= p is true, then we also execute: mul * p (again!), cast to int, change a value in BitArray, mul * p(third time!), I assume again a cast to int in the indexing of spd, and assign a value in the array spd.

To summarize, your second improved loop's every iteration is computationally "heavier". This is the reason why your improved version is slower.

Good answer, the only problem is the iteration count part— the number of iterations is exactly the same in both loops. — kcsquared, Feb 17 '22 at 06:21
@kcsquared yes, you are probably right. I thought they are the same, then it looked like they aren't, now I think they are the same again after your comment. I'll fix the answer. — bazzilic, Feb 17 '22 at 06:24

Sieve of Eratosthenes run slower after improving

Basic sieve

Enhance Sieve

Test

1 Answers1