i'm using OpenMP for a kNN project. The two parallelized for loops are:
#pragma omp parallel for
for(int ii=0;ii<sizeTest;ii++){
for(int t = 0 ; t<sizeTraining ; t++){
distance[ii][t]=EuclideanDistance(testSet[ii],trainingSet[t]);
}
and
#pragma omp parallel for
for(int ii=0;ii<sizeTest;ii++){
classifyPointFromDistance(trainingSet, &testSet[ii], distance[ii],k,sizeTraining);
}
I tried different combination of scheduling and this are the results:
Serial: 1020 sec
Static (default chunksize) - 4 Threads = 256,28 sec
Dynamic (default chunksize = 1) - 4 Threads = 256,27 sec
I expected that static would be the best since the iterations takes approximately the same time, while the dynamic would introduce too much overhead. This seems not to happen, and i can't understand why. Furthermore, in the static execution, seems like the speed up is linear except in the 16 Threads case:
Serial: 1020 sec
2 Threads: 511,35 sec
4 Threads: 256,28 sec
8 Threads: 128,66 sec
16 Threads: 90,98 sec
24 Threads: 61,4 sec
Why the 16 Threads case differs so much from the others? I'm running the algorithm on a Google VM machine with 24 Threads and 96 GB of ram. Thanks to all.