4

I'm trying to find the number of prime numbers below 400 million but even with just 40 million my code is taking 8 secs to run. what am i doing wrong?

what can i do to make it faster?

#include<iostream>
#include<math.h>
#include<vector>
using namespace std;
int main()
{
    vector<bool> k;                         
    vector<long long int> c;                
    for (int i=2;i<40000000;i++)
    {
        k.push_back(true);                  
        c.push_back(i);
    }

    for ( int i=0;i<sqrt(40000000)+1;i++)                            
    {                                                               
        if (k[i]==true)                                              
       {                                                            
           for (int j=i+c[i];j<40000000;j=j+c[i])                  
           {                                                       
               k[j]=false; 
           }
       }
    }
    vector <long long int> arr;
    for ( int i=0;i<40000000-2;i++)
    {
        if (k[i]==true)
        {
            arr.push_back(c[i]);
        }
    }
    cout << arr.size() << endl ;
    return 0;
}
LookAheadAtYourTypes
  • 1,629
  • 2
  • 20
  • 35
Nikhil
  • 71
  • 7
  • Your task is a slow one. – YoTengoUnLCD May 11 '16 at 20:24
  • @YoTengoUnLCD do you mind suggesting me some improvements. – Nikhil May 12 '16 at 15:23
  • 1
    [*Empirical Orders of Growth, **please!***](http://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth) To make it faster switch to segmented sieve, this is supposed to improve the cache locality. You don't need the vector `c` at all, just maintain a counter. Always start from `i=2`. Switch to working on odds only, `for i=3; ...; i+= 2`. Use odds-packed arrays where `p=2i+start_value` [or something](http://ideone.com/fapob). – Will Ness May 18 '16 at 20:49
  • 2
    you can also initialize the `k` array all at once with `s.resize(40000000, true)`, you know its size. -- and, `sqrt(40000000) = 6324.555`, always. simply declaring a new var to hold this value outside the for loop will make certain that it is optimized. -- do look into the Ideone entry I linked above - it runs ~0.23 sec for 40M and ~3.6 sec for 400M ([***which means it is growing at ~n^1.2 empirically***](https://en.wikipedia.org/wiki/Analysis_of_algorithms#Empirical_orders_of_growth)), and it isn't even segmented. Ideone is slow, your machine should be about 3 times faster than that or so. – Will Ness May 19 '16 at 07:49
  • @WillNess rather than just reserve memory all at once, you can also just fill the vector up-front without a need to push_back at all. see my answer for details – johnbakers May 19 '16 at 13:32
  • @WillNess sorry I misread your comment as `reserve` not `resize`. my bad – johnbakers May 19 '16 at 17:02
  • When marking off the multiples of a prime `p`, you can start from `p*p`; all of the smaller multiples of `p` have already been marked off. –  Jun 08 '16 at 06:53
  • 1
    Why `long long int` rather than, say, `int32_t`? Similarly, `i` and `j` should probably be of type `int32_t` or `int32_fast_t`; technically, you're not guaranteed that `int` is capable of holding numbers this large (although it's common that it can). –  Jun 08 '16 at 06:58

3 Answers3

5

I profiled your code as well as a simple tweak, below. The tweak is more than twice as fast:

    auto start = std::chrono::high_resolution_clock::now();

    //original version
    vector<bool> k;                         
    vector<long long int> c;                
    for (int i=2;i<40000000;i++)
      {
        k.push_back(true);                  
        c.push_back(i);
      }

    for ( int i=0;i<sqrt(40000000)+1;i++)                            
      {                                                               
        if (k[i]==true)                                              
          {                                                            
            for (int j=i+c[i];j<40000000;j=j+c[i])                  
              {                                                       
                k[j]=false; 
              }
          }
      }
    vector <long long int> arr;
    for ( int i=0;i<40000000-2;i++)
      {
        if (k[i]==true)
          {
            arr.push_back(c[i]);
          }
      }
    cout << arr.size() << endl ;


    auto end1 = std::chrono::high_resolution_clock::now();
    std::cout << "Elapsed = " << 
      std::chrono::duration_cast<std::chrono::milliseconds>(end1 - start).count() <<
      std::endl;

  }

  {

    auto begin = std::chrono::high_resolution_clock::now();

    //new version

    const long limit{40000000};
    vector<bool> k(limit-1,true);  

    //k[0] is the number 0
    k[0]=false; k[1]=false;

    auto sq = sqrt(limit) + 1;

    //start at the number 2
    for ( int i=2;i<sq;i++)                            
      {                                                               
        if (k[i]==true)                                              
          {                                                            
            for (int j=i+i;j<limit;j+=i)                  
              {                                                       
                k[j]=false; 
              }
          }
      }


    vector <long long int> arr;
    for ( int i=0;i<limit-2;i++)
      {
        if (k[i]==true)
          {
            arr.push_back(i);
          }
      }
    cout << arr.size() << endl ;



    auto stop = std::chrono::high_resolution_clock::now();
    std::cout << "Elapsed = " << 
      std::chrono::duration_cast<std::chrono::milliseconds>(stop - begin).count() <<
      std::endl;

  }

Here is the output (elapsed in milliseconds), in Debug mode:

2433654
Elapsed = 5787
2433654
Elapsed = 2432

Both have same results, second is much faster.

Here is another version using some nice C++ features (requiring less code), and it is about 11% faster than the second version above:

    auto begin = std::chrono::high_resolution_clock::now();

    const long limit{40000000};
    vector<int> k(limit-1,0);

    //fill with sequence of integers
    std::iota(k.begin(),k.end(),0);

    //k[0] is the number 0
    //integers reset to 0 are not prime
    k[0]=0; k[1]=0;

    auto sq = sqrt(limit) + 1;

    //start at the number 2
    for (int i=2;i<sq;i++)                            
      {                                                               
        if (k[i])
          {                                                            
            for (int j=i+i;j<limit;j+=i)                  
              {                                                       
                k[j]=0; 
              }
          }
      }

    auto results = std::remove(k.begin(),k.end(),0);

    cout << results - k.begin() << endl ;


    auto stop = std::chrono::high_resolution_clock::now();
    std::cout << "Elapsed = " << 
      std::chrono::duration_cast<std::chrono::milliseconds>(stop - begin).count() <<
      std::endl;

  }

Note that in your original version, you push_back in three different places, while this use of modern idioms never uses push_back at all when operating on the vectors.

In this example, the vector is of ints so that you have the actual list of prime numbers when you are finished.

Output:

2433654
Elapsed = 2160

These above are all Debug mode numbers.

In Release mode, the best is a combination of the second and third techniques above, using the numeric with a vector of bools, if you don't care what the actual prime numbers are in the end:

2433654
Elapsed = 1098
2433654
Elapsed bool remove= 410
2433654
Elapsed = 779

Note that your original code only takes about 1 second on my 5 year-old laptop in Release mode, so you are probably running in Debug mode.

johnbakers
  • 24,158
  • 24
  • 130
  • 258
  • did you check the [empirical orders of growth](http://stackoverflow.com/users/849891/will-ness?tab=profile) for your last variant (with the iterators and iota)? I can't imagine it being better than the vector code, with that much more memory use. – Will Ness May 19 '16 at 16:43
  • @WillNess the best variant I posted in my last section used vector of bool as I noted. But the vector of int is provided in the event the user wants a condensed list of the actual prime numbers themselves (rather than the count only) – johnbakers May 19 '16 at 17:01
  • @johnbakers thank you for the answer can you tell me how to change to realease mode – Nikhil May 19 '16 at 20:53
4

I got it down from taking 10 seconds to run to just half a second on my computer by changing two things. First, I'm guessing you didn't compile it with optimization enabled. That brought it from 10 seconds down to 1 second for me. Second, the vector c is unnecessary. Everywhere you have c[i] in your code you can replace it with i+2. This will make it run twice as fast.

Aaron
  • 2,354
  • 1
  • 17
  • 25
  • I'm using code blocks 13.12 on a intel i5-5200,8GB DDR3 machine. I've checked the optimize fully option in compiler settings but still it takes the same amount. what are the other ways to optimize the speed – Nikhil May 12 '16 at 15:21
  • 2
    Remove the c vector. That will make it twice as fast. – Aaron May 12 '16 at 16:01
  • @Nikhil also make sure you are not in Debug mode. That caused mine to take 5x longer than Release mode (see my answer). – johnbakers May 19 '16 at 10:08
4
  1. Remove vector c, you don't need it.
  2. Create vector k with known size at start. Repeatedly appending elements to a vector by invoking push_back() is a really bad idea from a performance point of view, as it can cause repeated memory reallocations and copies.
  3. http://primesieve.org/segmented_sieve.html - segmented version for inspiration.
  4. You can skip processing multiples of 2 and 3. link from code review
  5. It looks that you've got some issue in compiler optimization flag settings. Maybe you didn't change configuration from debug to release. What is your release speedup vs debug one?
Community
  • 1
  • 1
LookAheadAtYourTypes
  • 1,629
  • 2
  • 20
  • 35