3

I have a program that requires me to find primes up till 10**10-1 (10,000,000,000). I wrote a Sieve of Eratosthenes to do this, and it worked very well (and accurately) as high as 10**9 (1,000,000,000). I confirmed its accuracy by having it count the number of primes it found, and it matched the value of 50,847,534 on the chart found here. I used unsigned int as the storage type and it successfully found all the primes in approximately 30 seconds.

However, 10**10 requires that I use a larger storage type: long long int. Once I switched to this, the program is running signifigantly slower (its been 3 hours plus and its still working). Here is the relevant code:

typedef unsigned long long ul_long;
typedef unsigned int u_int;

ul_long max = 10000000000;                            
u_int blocks = 1250000000;
char memField[1250000000];     

char mapBit(char place) {             //convert 0->0x80, 1->0x40, 2->0x20, and so on
    return 0x80 >> (place);
}

for (u_int i = 2; i*i < max; i++) {

    if (memField[i / 8] & activeBit) {               //Use correct memory block
        for (ul_long n = 2 * i; n < max; n += i) {
            char secondaryBit = mapBit(n % 8);       //Determine bit position of n
            u_int activeByte = n / 8;                //Determine correct memory block
            if (n < 8) {                             //Manual override memory block and bit for first block
                secondaryBit = mapBit(n);
                activeByte = 0;
            }
            memField[activeByte] &= ~(secondaryBit);  //Set the flag to false
        }
    }
    activeBit = activeBit >> 1;                       //Check the next
    if (activeBit == 0x00) activeBit = 0x80;
} 

I figure that since 10**10 is 10x larger then 10**9 it should take 10 times the amount of time. Where is the flaw in this? Why did changing to long long cause such significant performance issues and how can I fix this? I recognize that the numbers get larger, so it should be somewhat slower, but only towards the end. Is there something I'm missing.

Note: I realize long int should technically be large enough but my limits.h says it isn't even though I'm compiling 64 bit. Thats why I use long long int in case anyone was wondering. Also, keep in mind, I have no computer science training, just a hobbyist.

edit: just ran it in "Release" as x86-64 with some of the debug statements suggested. I got the following output:

enter image description here

looks like I hit the u_int bound. I don't know why i is getting that large.

Ashwin Gupta
  • 2,159
  • 9
  • 30
  • 60
  • Your function is more likely to be of O(n*n) or worse. So if you go from 10 to the power of 9 to 10 to the power of 10, it has an effect on duration 10to the power of (2*10)/10 to the power of (9*2), i.e. 100, not 10. Or worse. – Yunnosch Aug 30 '17 at 06:18
  • Use `` to avoid the weirdness of platform-dependent integer sizes. – o11c Aug 30 '17 at 06:22
  • @o11c ah thanks I'll be sure to keep that in mind. I had some weirdness alright ;) – Ashwin Gupta Aug 30 '17 at 06:23
  • @Yunnosch that makes sense. Still, its been 3.5 hours. My machine is an i7 at 3.4 ghz and 12 GB DDR3. I'd say my computer is reasonably powerful and the number isn't that high. Is my implementation of the sieve flawed? Or is it something else? – Ashwin Gupta Aug 30 '17 at 06:24
  • What compiler, what optimisation flags? – rustyx Aug 30 '17 at 06:28
  • @AshwinGupta you can add - `if(i%1000000 == 0) printf("Completed till i = %lld\n", i);` for regular monitoring. – Ajay Brahmakshatriya Aug 30 '17 at 06:30
  • @AshwinGupta I haven't read your code or anything, but your cpu specs don't really factor into this. It's likely to do with the speed of your code degrading rather fast as the number of primes you want to search for increases. Computers, no matter how powerful they are, can't do everything. Cryptography is an example. – Millie Smith Aug 30 '17 at 06:30
  • @AjayBrahmakshatriya yeah. Im using Visual Studio so I just pause every once in a while and take a look. Thanks though, this might be better. – Ashwin Gupta Aug 30 '17 at 06:31
  • 2
    You can do some optimisations, of course. Make a sieve for odd primes only. And start the inner loop with n = i*i because all smaller multiples of n have already been removed. Make sure that mapBit is inlined. And compile for a 64 bit system, not 32 bits. That will make a huge difference. – gnasher729 Aug 30 '17 at 06:31
  • @AshwinGupta does that mean you are compiling in debug? That is bound to be very slow. Please compile with maximum optimizations. – Ajay Brahmakshatriya Aug 30 '17 at 06:31
  • @AjayBrahmakshatriya I didn't know this, I will compile as "Release" with x86-64 (as I've been doing). – Ashwin Gupta Aug 30 '17 at 06:32
  • @AshwinGupta yes, Release will be the suitable configuration. – Ajay Brahmakshatriya Aug 30 '17 at 06:33
  • @AjayBrahmakshatriya I can't imagine that 10^9 going to 10^10 suddenly becomes really slow because it's a debug build. That would surprise me. – Millie Smith Aug 30 '17 at 06:33
  • @MillieSmith good to know about the specs. I just wanted to put that out there in case it was relevant. I don't know much about this so I just try to make sure I can help people help me ;) – Ashwin Gupta Aug 30 '17 at 06:35
  • @MillieSmith Technically Sieve is `O(n*n)` so it is going to be 100 times slower. But still it shouldn't take so much time. Release would make 10**9 also faster. – Ajay Brahmakshatriya Aug 30 '17 at 06:35
  • @AjayBrahmakshatriya Fair enough. And sounds good Ashwin. Understood. – Millie Smith Aug 30 '17 at 06:35
  • @AshwinGupta you can also further optimize your body by using `memField[n/8] |= 1 << (n%8)`. No need to if conditions and function calls etc. – Ajay Brahmakshatriya Aug 30 '17 at 06:38
  • 1
    Maybe I've missed something, but OP doesn't seem to have made the straightforward comparison of the performance of the two codes on the same problem size. Right now all OP can state is that the `long long int` version is taking longer to solve a larger problem, not that it is running more slowly than the other version. – High Performance Mark Aug 30 '17 at 06:42
  • @HighPerformanceMark this is indeed correct. But these optimization tips are helpful nonetheless. – Ashwin Gupta Aug 30 '17 at 06:44
  • Ditto from @gnasher729, you can make it more efficient by ignoring all even values, and stepping by 2 when looking for a prime, and when filling the seive stepping `2 * i`. More importantly fill the seive from `i * i` (not `i + i`). – Weather Vane Aug 30 '17 at 06:44
  • @WeatherVane yes I'm going to implement that tomorrow morning before school and try. Unfortunately I'm gonna have to call it a night because its midnight here. Thanks to all for your help, I will keep testing with everyone's suggestions and updating the question – Ashwin Gupta Aug 30 '17 at 06:45
  • @RustyX Visual Studio 2015 "cl". Default "Release" configuration. – Ashwin Gupta Aug 30 '17 at 06:46
  • 1
    @AjayBrahmakshatriya: This sieve is absolutely _not_ O (n^2). I think it's O (n log log n), practically linear. – gnasher729 Aug 30 '17 at 18:32
  • @gnasher729 Yes, I agree its not `O(n^2)`. I rushed through the calculation. Anyway, the problem is clear. Thanks for pointing out. – Ajay Brahmakshatriya Aug 30 '17 at 18:45

1 Answers1

12

Your program has an infinite loop in for (u_int i = 2; i*i < max; i++). i is an unsigned int so i*i wraps at 32-bit and is always less than max. Make i an ul_long.

Note that you should use simpler bit pattern from 1 to 0x80 for bit 0 to 7.

Here is a complete version:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef unsigned long long ul_long;
typedef unsigned int u_int;

#define TESTBIT(a, bit)   (a[(bit) / 8] & (1 << ((bit) & 7)))
#define CLEARBIT(a, bit)  (a[(bit) / 8] &= ~(1 << ((bit) & 7)))

ul_long count_primes(ul_long max) {
    size_t blocks = (max + 7) / 8;
    unsigned char *memField = malloc(blocks);
    if (memField == NULL) {
        printf("cannot allocate memory for %llu bytes\n",
               (unsigned long long)blocks);
        return 0;
    }
    memset(memField, 255, blocks);
    CLEARBIT(memField, 0);  // 0 is not prime
    CLEARBIT(memField, 1);  // 1 is not prime
    // clear bits after max
    for (ul_long i = max + 1; i < blocks * 8ULL; i++) {
        CLEARBIT(memField, i);
    }

    for (ul_long i = 2; i * i < max; i++) {
        if (TESTBIT(memField, i)) {           //Check if i is prime
            for (ul_long n = 2 * i; n < max; n += i) {
                CLEARBIT(memField, n);                   //Reset all multiples of i
            }
        }
    }
    unsigned int bitCount[256];
    for (int i = 0; i < 256; i++) {
        bitCount[i] = (((i >> 0) & 1) + ((i >> 1) & 1) +
                       ((i >> 2) & 1) + ((i >> 3) & 1) +
                       ((i >> 4) & 1) + ((i >> 5) & 1) +
                       ((i >> 6) & 1) + ((i >> 7) & 1));
    }
    ul_long count = 0;
    for (size_t i = 0; i < blocks; i++) {
        count += bitCount[memField[i]];
    }
    printf("count of primes up to %llu: %llu\n", max, count);
    free(memField);
    return count;
}

int main(int argc, char *argv[]) {
    if (argc > 1) {
        for (int i = 1; i < argc; i++) {
            count_primes(strtoull(argv[i], NULL, 0));
        }
    } else {
        count_primes(10000000000);
    }
    return 0;
}

It completes in 10 seconds for 10^9 and 131 seconds for 10^10:

count of primes up to 1000000000: 50847534
count of primes up to 10000000000: 455052511
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • Something similar may need to happen in the inner loop as well for the `2 * i` expression. – Michael Burr Aug 30 '17 at 07:01
  • @MichaelBurr: good point! making `i` an `ul_long` is the safer solution. – chqrlie Aug 30 '17 at 07:23
  • Hmm... there is something fishy about the algorithm. Try it for a max value of `541` (for which there are `100` primes), but the output reports `103` ? Then try with a max of `522` it now reports `104` primes... But it does work on the even `100`, `1000`, etc. numbers. (Pentium math?) – David C. Rankin Aug 30 '17 at 08:19
  • @DavidC.Rankin that happens because `memField` has more than `max` bits, if `max` is not a multiple of `8` and the leftover bits are set to `1` and counted for the result. – mch Aug 30 '17 at 08:30
  • Adding `for (ul_long i = max + 1; i < blocks * 8; ++i) CLEARBIT(memField, i);` after `memset` solves this problem: https://ideone.com/K1OGX9 – mch Aug 30 '17 at 08:36
  • +1 Thanks for the answer, the i* I thing should've been obvious! The optimization will come in handy. I will test in the afternoon and accept as long as it works! – Ashwin Gupta Aug 30 '17 at 14:02
  • @DavidC.Rankin yeah this is intended to work only for high powers of 10 so I chose that design. I know it's confusing sorry. – Ashwin Gupta Aug 30 '17 at 14:03
  • @AshwinGupta: there was indeed a small error if max was not a multiple of 8. Thank you David for for catching it and mch for fixing it. – chqrlie Aug 30 '17 at 19:18
  • @AshwinGupta: further optimisations include skipping the even bits, halfing the memory requirements and speeding up the sieve by at least a factor of 2, and computing pi(x) in smaller chunks that fit in the L3 cache. – chqrlie Aug 30 '17 at 19:20
  • @mch: thanks for the fix. The test should actually be `i < blocks * 8ULL` to avoid a potential overflow if `sizeof(size_t) < sizeof(ul_long)`. – chqrlie Aug 30 '17 at 19:24
  • @chqrlie thanks, this worked great and thanks to everyone else who helped me also! – Ashwin Gupta Aug 31 '17 at 04:42