Multithreaded Segmented Sieve of Eratosthenes in Java

Question

I am trying to create a fast prime generator in Java. It is (more or less) accepted that the fastest way for this is the segmented sieve of Eratosthenes: https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes. Lots of optimizations can be further implemented to make it faster. As of now, my implementation generates 50847534 primes below 10^9 in about 1.6 seconds, but I am looking to make it faster and at least break the 1 second barrier. To increase the chance of getting good replies, I will include a walkthrough of the algorithm as well as the code.

Still, as a TL;DR, I am looking to include multi-threading into the code

For the purposes of this question, I want to separate between the 'segmented' and the 'traditional' sieves of Eratosthenes. The traditional sieve requires O(n) space and therefore is very limited in range of the input (the limit of it). The segmented sieve however only requires O(n^0.5) space and can operate on much larger limits. (A main speed-up is using a cache-friendly segmentation, taking into account the L1 & L2 cache sizes of the specific computer). Finally, the main difference that concerns my question is that the traditional sieve is sequential, meaning it can only continue once the previous steps are completed. The segmented sieve however, is not. Each segment is independent, and is 'processed' individually against the sieving primes (the primes not larger than n^0.5). This means that theoretically, once I have the sieving primes, I can divide the work between multiple computers, each processing a different segment. The work of eachother is independent of the others. Assuming (wrongly) that each segment requires the same amount of time t to complete, and there are k segments, One computer would require total time of T = k * t, whereas k computers, each working on a different segment would require a total amount of time T = t to complete the entire process. (Practically, this is wrong, but for the sake of simplicity of the example).

This brought me to reading about multithreading - dividing the work to a few threads each processing a smaller amount of work for better usage of CPU. To my understanding, the traditional sieve cannot be multithreaded exactly because it is sequential. Each thread would depend on the previous, rendering the entire idea unfeasible. But a segmented sieve may indeed (I think) be multithreaded.

Instead of jumping straight into my question, I think it is important to introduce my code first, so I am hereby including my current fastest implementation of the segmented sieve. I have worked quite hard on it. It took quite some time, slowly tweaking and adding optimizations to it. The code is not simple. It is rather complex, I would say. I therefore assume the reader is familiar with the concepts I am introducing, such as wheel factorization, prime numbers, segmentation and more. I have included notes to make it easier to follow.

import java.math.BigInteger;
import java.util.ArrayList;
import java.util.Arrays;

public class primeGen {

    public static long x = (long)Math.pow(10, 9); //limit
    public static int sqrtx;
    public static boolean [] sievingPrimes; //the sieving primes, <= sqrtx

    public static int [] wheels = new int [] {2,3,5,7,11,13,17,19}; // base wheel primes
    public static int [] gaps; //the gaps, according to the wheel. will enable skipping multiples of the wheel primes
    public static int nextp; // the first prime > wheel primes
    public static int l; // the amount of gaps in the wheel

    public static void main(String[] args)
    {
        long startTime = System.currentTimeMillis();

        preCalc();  // creating the sieving primes and calculating the list of gaps

        int segSize = Math.max(sqrtx, 32768*8); //size of each segment
        long u = nextp; // 'u' is the running index of the program. will continue from one segment to the next
        int wh = 0; // the will be the gap index, indicating by how much we increment 'u' each time, skipping the multiples of the wheel primes

        long pi = pisqrtx(); // the primes count. initialize with the number of primes <= sqrtx

        for (long low = 0 ; low < x ; low += segSize) //the heart of the code. enumerating the primes through segmentation. enumeration will begin at p > sqrtx
        {
            long high = Math.min(x, low + segSize);
            boolean [] segment = new boolean [(int) (high - low + 1)];

            int g = -1;
            for (int i = nextp ; i <= sqrtx ; i += gaps[g])
            { 
                if (sievingPrimes[(i + 1) / 2])
                {
                    long firstMultiple = (long) (low / i * i);
                    if (firstMultiple < low) 
                        firstMultiple += i; 
                    if (firstMultiple % 2 == 0) //start with the first odd multiple of the current prime in the segment
                        firstMultiple += i;

                    for (long j = firstMultiple ; j < high ; j += i * 2) 
                        segment[(int) (j - low)] = true; 
                }
                g++;
                //if (g == l) //due to segment size, the full list of gaps is never used **within just one segment** , and therefore this check is redundant. 
                              //should be used with bigger segment sizes or smaller lists of gaps
                    //g = 0;
            }

            while (u <= high)
            {
                if (!segment[(int) (u - low)])
                    pi++;
                u += gaps[wh];
                wh++;
                if (wh == l)
                    wh = 0;
            }
        }

        System.out.println(pi);

        long endTime = System.currentTimeMillis();
        System.out.println("Solution took "+(endTime - startTime) + " ms");
    }

    public static boolean [] simpleSieve (int l)
    {
        long sqrtl = (long)Math.sqrt(l);
        boolean [] primes = new boolean [l/2+2];
        Arrays.fill(primes, true);
        int g = -1;
        for (int i = nextp ; i <= sqrtl ; i += gaps[g])
        {
            if (primes[(i + 1) / 2])
                for (int j = i * i ; j <= l ; j += i * 2)
                    primes[(j + 1) / 2]=false;
            g++;
            if (g == l)
                g=0;
        }
        return primes;
    }

    public static long pisqrtx ()
    {
        int pi = wheels.length;
        if (x < wheels[wheels.length-1])
        {
            if (x < 2)
                return 0;
            int k = 0;
            while (wheels[k] <= x)
                k++;
            return k;
        }
        int g = -1;
        for (int i = nextp ; i <= sqrtx ; i += gaps[g])
        {
            if(sievingPrimes[( i + 1 ) / 2])
                pi++;
            g++;
            if (g == l)
                g=0;
        }

        return pi;
    }

    public static void preCalc ()
    {
        sqrtx = (int) Math.sqrt(x);

        int prod = 1;
        for (long p : wheels)
            prod *= p; // primorial
        nextp = BigInteger.valueOf(wheels[wheels.length-1]).nextProbablePrime().intValue(); //the first prime that comes after the wheel
        int lim = prod + nextp; // circumference of the wheel

        boolean [] marks = new boolean [lim + 1];
        Arrays.fill(marks, true);

        for (int j = 2 * 2 ;j <= lim ; j += 2)
            marks[j] = false;
        for (int i = 1 ; i < wheels.length ; i++)
        {
            int p = wheels[i];
            for (int j = p * p ; j <= lim ; j += 2 * p)
                marks[j]=false;   // removing all integers that are NOT comprime with the base wheel primes
        }
        ArrayList <Integer> gs = new ArrayList <Integer>(); //list of the gaps between the integers that are coprime with the base wheel primes
        int d = nextp;
        for (int p = d + 2 ; p < marks.length ; p += 2)
        {
            if (marks[p]) //d is prime. if p is also prime, then a gap is identified, and is noted.
            {
                gs.add(p - d);
                d = p;
            }
        }
        gaps = new int [gs.size()];
        for (int i = 0 ; i < gs.size() ; i++)
            gaps[i] = gs.get(i); // Arrays are faster than lists, so moving the list of gaps to an array
        l = gaps.length;

        sievingPrimes = simpleSieve(sqrtx); //initializing the sieving primes
    }

}

Currently, it produces 50847534 primes below 10^9 in about 1.6 seconds. This is very impressive, at least by my standards, but I am looking to make it faster, possibly break the 1 second barrier. Even then, I believe it can be made much faster still.

The whole program is based on wheel factorization: https://en.wikipedia.org/wiki/Wheel_factorization. I have noticed I am getting the fastest results using a wheel of all primes up to 19.

public static int [] wheels = new int [] {2,3,5,7,11,13,17,19}; // base wheel primes

This means that the multiples of those primes are skipped, resulting in a much smaller searching range. The gaps between numbers which we need to take are then calculated in the preCalc method. If we make those jumps between the the numbers in the searching range we skip the multiples of the base primes.

public static void preCalc ()
    {
        sqrtx = (int) Math.sqrt(x);

        int prod = 1;
        for (long p : wheels)
            prod *= p; // primorial
        nextp = BigInteger.valueOf(wheels[wheels.length-1]).nextProbablePrime().intValue(); //the first prime that comes after the wheel
        int lim = prod + nextp; // circumference of the wheel

        boolean [] marks = new boolean [lim + 1];
        Arrays.fill(marks, true);

        for (int j = 2 * 2 ;j <= lim ; j += 2)
            marks[j] = false;
        for (int i = 1 ; i < wheels.length ; i++)
        {
            int p = wheels[i];
            for (int j = p * p ; j <= lim ; j += 2 * p)
                marks[j]=false;   // removing all integers that are NOT comprime with the base wheel primes
        }
        ArrayList <Integer> gs = new ArrayList <Integer>(); //list of the gaps between the integers that are coprime with the base wheel primes
        int d = nextp;
        for (int p = d + 2 ; p < marks.length ; p += 2)
        {
            if (marks[p]) //d is prime. if p is also prime, then a gap is identified, and is noted.
            {
                gs.add(p - d);
                d = p;
            }
        }
        gaps = new int [gs.size()];
        for (int i = 0 ; i < gs.size() ; i++)
            gaps[i] = gs.get(i); // Arrays are faster than lists, so moving the list of gaps to an array
        l = gaps.length;

        sievingPrimes = simpleSieve(sqrtx); //initializing the sieving primes
    }

At the end of the preCalc method, the simpleSieve method is called, efficiently sieving all the sieving primes mentioned before, the primes <= sqrtx. This is a simple Eratosthenes sieve, rather than segmented, but it is still based on wheel factorization, perviously computed.

 public static boolean [] simpleSieve (int l)
    {
        long sqrtl = (long)Math.sqrt(l);
        boolean [] primes = new boolean [l/2+2];
        Arrays.fill(primes, true);
        int g = -1;
        for (int i = nextp ; i <= sqrtl ; i += gaps[g])
        {
            if (primes[(i + 1) / 2])
                for (int j = i * i ; j <= l ; j += i * 2)
                    primes[(j + 1) / 2]=false;
            g++;
            if (g == l)
                g=0;
        }
        return primes;
    }

Finally, we reach the heart of the algorithm. We start by enumerating all primes <= sqrtx, with the following call:

 long pi = pisqrtx();`

which used the following method:

public static long pisqrtx ()
    {
        int pi = wheels.length;
        if (x < wheels[wheels.length-1])
        {
            if (x < 2)
                return 0;
            int k = 0;
            while (wheels[k] <= x)
                k++;
            return k;
        }
        int g = -1;
        for (int i = nextp ; i <= sqrtx ; i += gaps[g])
        {
            if(sievingPrimes[( i + 1 ) / 2])
                pi++;
            g++;
            if (g == l)
                g=0;
        }

        return pi;
    }

Then, after initializing the pi variable which keeps track of the enumeration of primes, we perform the mentioned segmentation, starting the enumeration from the first prime > sqrtx:

 int segSize = Math.max(sqrtx, 32768*8); //size of each segment
        long u = nextp; // 'u' is the running index of the program. will continue from one segment to the next
        int wh = 0; // the will be the gap index, indicating by how much we increment 'u' each time, skipping the multiples of the wheel primes

        long pi = pisqrtx(); // the primes count. initialize with the number of primes <= sqrtx

        for (long low = 0 ; low < x ; low += segSize) //the heart of the code. enumerating the primes through segmentation. enumeration will begin at p > sqrtx
        {
            long high = Math.min(x, low + segSize);
            boolean [] segment = new boolean [(int) (high - low + 1)];

            int g = -1;
            for (int i = nextp ; i <= sqrtx ; i += gaps[g])
            { 
                if (sievingPrimes[(i + 1) / 2])
                {
                    long firstMultiple = (long) (low / i * i);
                    if (firstMultiple < low) 
                        firstMultiple += i; 
                    if (firstMultiple % 2 == 0) //start with the first odd multiple of the current prime in the segment
                        firstMultiple += i;

                    for (long j = firstMultiple ; j < high ; j += i * 2) 
                        segment[(int) (j - low)] = true; 
                }
                g++;
                //if (g == l) //due to segment size, the full list of gaps is never used **within just one segment** , and therefore this check is redundant. 
                              //should be used with bigger segment sizes or smaller lists of gaps
                    //g = 0;
            }

            while (u <= high)
            {
                if (!segment[(int) (u - low)])
                    pi++;
                u += gaps[wh];
                wh++;
                if (wh == l)
                    wh = 0;
            }
        }

I have also included it as a note, but will explain as well. Because the segment size is relatively small, we will not go through the entire list of gaps within just one segment, and checking it - is redundant. (Assuming we use a 19-wheel). But in a broader scope overview of the program, we will make use of the entire array of gaps, so the variable u has to follow it and not accidentally surpass it:

 while (u <= high)
            {
                if (!segment[(int) (u - low)])
                    pi++;
                u += gaps[wh];
                wh++;
                if (wh == l)
                    wh = 0;
            }

Using higher limits will eventually render a bigger segment, which might result in a neccessity of checking we don't surpass the gaps list even within the segment. This, or tweaking the wheel primes base might have this effect on the program. Switching to bit-sieving can largely improve the segment limit though.

As an important side-note, I am aware that efficient segmentation is one that takes the L1 & L2 cache-sizes into account. I get the fastest results using a segment size of 32,768 * 8 = 262,144 = 2^18. I am not sure what the cache-size of my computer is, but I do not think it can be that big, as I see most cache sizes <= 32,768. Still, this produces the fastest run time on my computer, so this is why it's the chosen segment size.
As I mentioned, I am still looking to improve this by a lot. I believe, according to my introduction, that multithreading can result in a speed-up factor of 4, using 4 threads (corresponding to 4 cores). The idea is that each thread will still use the idea of the segmented sieve, but work on different portions. Divide the n into 4 equal portions - threads, each in turn performing the segmentation on the n/4 elements it is responsible for, using the above program. My question is how do I do that? Reading about multithreading and examples, unfortunately, did not bring to me any insight on how to implement it in the case above efficiently. It seems to me, as opposed to the logic behind it, that the threads were running sequentially, rather than simultaneously. This is why I excluded it from the code to make it more readable. I will really appreciate a code sample on how to do it in this specific code, but a good explanation and reference will maybe do the trick too.

Additionally, I would like to hear about more ways of speeding-up this program even more, any ideas you have, I would love to hear! Really want to make it very fast and efficient. Thank you!

(1) Does your code use a common mutable data structure? Can you update different segments independently? If cannot, thread synchronization will eat all the performance gains. (2) Once you've figured the way to cut your computation into independent jobs, put them into a queue, and use a thread pool to read from queue and schedule the jobs to a fixed number of threads (normally the number of cores, maybe one less because JVM runs its own housekeeping threads, like GC). (3) The speedup is [never as linear](https://en.wikipedia.org/wiki/Amdahl's_law) though a respectable speedup is possible. — 9000, Jul 25 '19 at 14:22
@9000 1. Yes, each segment may be processed independently. The processing of the segments solely depends on the sieving primes, which are pre-computed and are identical throuought the entire program. The updating of every segment does not influence at all any other. 2. It's exactly what I am asking, how to do that. My computer has 4 cores, hence 4 threads. How do I create those threads in this specific program to run simultaneously? 3. Yes, I understand that. For simplification I assumed it is linear, and that each segment is processed at constant time. Obviously these statements are wrong. — MC From Scratch, Jul 25 '19 at 14:41
just for comparison, my simple non-segmented C++ sieve takes 3.74s [on ideone](https://ideone.com/fapob) to sieve to 10^9. and I think the monolithic sieve can be multi-threaded just as well -- for each prime in sequence, mark with it the core area, and spawn an independent worker to finish up the job. thus you have a pool of workers which can work out of order in parallel with the core worker. just need a work stealing scheduler for them. — Will Ness, Jul 25 '19 at 17:22
@WillNess It makes sense. My fastest non-segmented (simple) variant takes 3.98s for `10^9`. But it uses wheel too. In general C++ is faster when calculations are concerned. As for multithreading the simple sieve, I would be interested to see it in practice, it sounds interesting. In a simple sieve, the choice of the next 'prime' depends on the fact that all multiples of the previous one have been marked off. Thus, initializing a new thread might choose a 'prime' that isn't actually a prime - it just hasn't been marked off yet. — MC From Scratch, Jul 25 '19 at 17:31
no, we'd have one main thread that marks the sieve array up to the sqrt(N). *that* is done sequentially of course. it chooses the next prime, marks the core up to sqrt(N), and *then* spawns the independent worker to mark *the rest*, up to N; and after that the loop continues to the next prime, and so on -- all below the sqrt(N). then the main loop stops and the workers continue in arbitrary order until there are no more workers. the last found core prime would spawn a worker very near to the N itself, so much so that it'd have very little marking to do left for it. — Will Ness, Jul 25 '19 at 17:54
or maybe not even that. it is enough for the next identified prime's marker to only make few steps, until the first hole - gap - *behind* it is detected. which is the *next* prime. having found the next prime P, we can spawn the marking from P^2 immediately, but we have to wait for the previous prime marker to reach the sqrt(N). so yeah, it's a delicate dance below the sqrt(N), but simple above it. — Will Ness, Jul 25 '19 at 18:01

9000 · Answer 1 · 2019-07-25T16:46:34.907

An example like this should help you get started.

An outline of a solution:

Define a data structure ("Task") that encompasses a specific segment; you can put all the immutable shared data into it for extra neatness, too. If you're careful enough, you can pass a common mutable array to all tasks, along with the segment limits, and only update the part of the array within these limits. This is more error-prone, but can simplify the step of joining the results (AFAICT; YMMV).
Define a data structure ("Result") that stores the result of a Task computation. Even if you just update a shared resulting structure, you may need to signal which part of that structure has been updated so far.
Create a Runnable that accepts a Task, runs a computation, and puts the results into a given result queue.
Create a blocking input queue for Tasks, and a queue for Results.
Create a ThreadPoolExecutor with the number of threads close to the number of machine cores.
Submit all your Tasks to the thread pool executor. They will be scheduled to run on the threads from the pool, and will put their results into the output queue, not necessarily in order.
Wait for all the tasks in the thread pool to finish.
Drain the output queue and join the partial results into the final result.

Extra speedup may (or may not) be achieved by joining the results in a separate task that reads the output queue, or even by updating a mutable shared output structure under synchronized, depending on how much work the joining step involves.

Hope this helps.

Of course it helps, thank you. Will go through it and try to use it. — MC From Scratch, Jul 25 '19 at 15:59

score 1 · Answer 2 · answered Jul 25 '19 at 16:59

1

Are you familiar with the work of Tomas Oliveira e Silva? He has a very fast implementation of the Sieve of Eratosthenes.

answered Jul 25 '19 at 16:59

user448810

17,381
4
34
59

Yes, indeed I am familiar with his work. Parts of my program, such as the segmentation and wheel factorization are ideas that are also present in his work. However, I will be honest here and say that a few, crucial concepts in his algorithm are beyond me. Most of which are not mathematical but rather computational. For example, the 'bucket sieve' - what he refers to as the storage of prime multiple in lists. Not sure what he means exactly and how to implement it. I do intend to get there, but not just yet. – MC From Scratch Jul 25 '19 at 17:18
@MCFromScratch the segmented sieve is presented as having the core - and segment - size of O(sqrt(N)). But another take on it is that there is no N, the sieve is indefinite, and we sieve the segments which are (no less in size than the distance) between the consecutive *squares* of primes. When the core primes become big enough, any *fixed sized* segment will start missing some hits from such big primes and complexity will deteriorate (we spend work on each prime, calculating its starting multiple `m` *in* the segment, but it might not be there at all). Bucket sieve saves such `m`s for later. – Will Ness Jul 25 '19 at 18:53
@WillNess That is actually a very good and simple explanation. From what input does this become useful? And how is it implemented? I guess it reuires pre-computation of a table for those multiples, so we do not generate a multiple whilst inside the algorithm, which won't be used in the specific segment. – MC From Scratch Jul 25 '19 at 19:21
My guess is, as each prime's *next multiple* is calculated, the segment where it falls into is identified, and a "bucket" of such multiples is saved for each segment. that way the work for that prime isn't redone for each segment - the multiple already awaits at *its segment's* bucket for the time when that segment's processing will be activated. When a segment is finished, *its* primes' next multiples are calculated and placed each in corresponding segment's *bucket*. this is mostly a guesswork though, on my part. – Will Ness Jul 25 '19 at 19:30
@WillNess I see. It makes sense actually. Keeping track of the multiples saves a lot of time in the long run, because many calculations are spared. I will try to implement something of this sort. – MC From Scratch Jul 25 '19 at 19:58
"saves time" -- only in case some primes become larger than the segment's size. while the size is adequate and there are no misses, no effort is wasted, and no buckets are needed. we could increase the segment size instead, but then at some point it won't fit in the cache anymore. – Will Ness Jul 25 '19 at 20:13
@WillNess Yes, but you still gave me the idea of keeping track of the multiples for each prime regardless. For every prime, I calculated the first multiple of it in the segment and then kept incrementing it untill it was out of the segment range. Once it was, I just dumped it, and calculated the first multiple of it in the next segment all over again. But actually keeping the value once it's out of the segment range in a separate array of the multiples, spares lots of computation time. For `10^9` it's not massive - only about 5%. But for `10^10`, it reduces computation time by 20% ! – MC From Scratch Jul 25 '19 at 20:26
huh, it depends on the language, I guess. In Haskell it was the other way around, for me. Keeping them was slower than recalculating. – Will Ness Jul 25 '19 at 20:28
@WillNess Really? This is weird. What other optimizations do you have in mind? What else does your algorithm include? For example, do you know how to switch to bit-sieving segmentation rather than the straight-forward boolean-segment sieve? – MC From Scratch Jul 25 '19 at 20:51
Pure Haskell doesn't have global mutable variables, so you end up having to maintain them in a list and passing it around. Few `*` and `mod` calls are cheaper (last time I checked -- which is more than a few years back). It was [this](https://wiki.haskell.org/Prime_numbers#Generating_Segments_of_Primes) version, I think. But it has more of an educational flavor. The real, fast code is in the [arithmoi package](https://hackage.haskell.org/package/arithmoi) by Daniel Fischer. – Will Ness Jul 26 '19 at 10:39

score 0 · Answer 3 · answered Aug 22 '19 at 16:45

How interested in speed are you? Would you consider using c++?

$ time ../c_code/segmented_bit_sieve 1000000000
50847534 primes found.

real    0m0.875s
user    0m0.813s
sys     0m0.016s
$ time ../c_code/segmented_bit_isprime 1000000000
50847534 primes found.

real    0m0.816s
user    0m0.797s
sys     0m0.000s

(on my newish laptop with an i5)

The first is from @Kim Walisch using a bit array of odd prime candidates.

https://github.com/kimwalisch/primesieve/wiki/Segmented-sieve-of-Eratosthenes

The second is my tweak to Kim's with IsPrime[] also implemented as bit array, which is slightly less clear to read, although a little faster for big N due to the reduced memory footprint.

I will read your post carefully as I am interested in primes and performance no matter what language is used. I hope this isn't too far off topic or premature. But I noticed I was already beyond your performance goal.

My current version is faster than the one posted here. it runs in about 1.05s for `10^9`. I purposely do not use C++, because I would like to program in Java - it is the main language I use for most purposes and that's why I need the code in this particular language. I know C++ outperforms Java in that sense, but still. Any advice you have for alternating the code so that it runs faster is largely appreciated. — MC From Scratch, Aug 24 '19 at 08:28

Multithreaded Segmented Sieve of Eratosthenes in Java

3 Answers3