I implemented 2 versions of of the pi approximation. I tested it and noticed that one version is much faster but i don't really understand why. In the first version i created an array of the size of defined number of processes and updating the indexes, in the second version i used just reduction.
first version:
#pragma omp parallel private(x) shared(sum_vector)
{
int tid = omp_get_thread_num();
for (int i = tid; i < num_steps; i += threads_number){
x = (i+0.5)*step;
sum_vector[tid] += 4.0/(1.0+x*x);
}
}
second version:
#pragma omp parallel reduction(+:sum) private(x)
{
int nthreads = omp_get_num_threads();
int id = omp_get_thread_num();
for (int i = id; i < num_steps; i += nthreads){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
}
The second version is almost twice as fast for 1 Million iterations or higher.
I would appreciate every answer! Thank you in advance!