When using an OpenACC "#pragma acc routine worker
"-routine, that contains multiple loops of vector (and worker) level parallelism, how do vector_length
and num_workers
work?
I played around with some code (see below) and stumbled upon a few things:
- Setting the vector length of these loops is seriously confusing me. Using the
vector_length(#)
clause on the outerparallel
region seems to work weirdly, when comparing run times. When I increase the vector length to huge numbers, say e.g.4096
, the run time actually gets smaller. In my understanding, a huge amount of threads should lie dormant when there are only as many as10
iterations in the vector loop. Am I doing something wrong here? - I noticed that the output weirdly depends on the number of workers in
foo()
. If it is16
or smaller, the output is "correct". If it is32
and even much larger, the loops inside the worker routine somehow get executed twice. What am I missing here?
Can someone give me a hand with the OpenACC routine
clause? Many thanks in advance.
Here is the example code:
#include <iostream>
#include <chrono>
class A{
public:
int out;
int* some_array;
A(){
some_array = new int[1000*100*10];
for(int i = 0; i < 1000*100*10; ++i){
some_array[i] = 1;
}
#pragma acc enter data copyin(this, some_array[0:1000*100*10])
};
~A(){
#pragma acc exit data delete(some_array, this)
delete [] some_array;
}
#pragma acc routine worker
void some_worker(int i){
int private_out = 10;
#pragma acc loop vector reduction(+: private_out)
for(int j=0; j < 10; ++j){
//do some stuff
private_out -= some_array[j];
}
#pragma acc loop reduction(+: private_out) worker
for(int j=0; j < 100; ++j){
#pragma acc loop reduction(+: private_out) vector
for(int k=0; k < 10; ++k){
//do some other stuff
private_out += some_array[k+j*10+i*10*100];
}
}
#pragma acc atomic update
out += private_out;
}
void foo(){
#pragma acc data present(this, some_array[0:1000*100*10]) pcreate(out)
{
#pragma acc serial
out=0;
//#######################################################
//# setting num_workers and vector_length produce weird #
//# results and runtimes #
//#######################################################
#pragma acc parallel loop gang num_workers(64) vector_length(4096)
for(int i=0; i < 1000; ++i){
some_worker(i);
}
#pragma acc update host(out)
}
}
};
int main() {
using namespace std::chrono;
A a;
auto start = high_resolution_clock::now();
a.foo();
auto stop = high_resolution_clock::now();
std::cout << a.out << std::endl
<< "took " << duration_cast<microseconds>(stop - start).count() << "ms" << std::endl;
//output for num_workers(16) vector_length(4096)
//1000000
//took 844ms
//
//output for num_workers(16) vector_length(2)
//1000000
//took 1145ms
//
//output for num_workers(32) vector_length(2)
//1990000
//took 1480ms
//
//output for num_workers(64) vector_length(1)
//1990000
//took 502ms
//
//output for num_workers(64) vector_length(4096)
//1000000
//took 853ms
return 0;
}
Machine specs: nvc++ 21.3-0 with OpenACC 2.7, Tesla K20c with cc35, NVIDIA-driver 470.103.01 with CUDA 11.4
Edit:
Additional information for 2.:
I simply used some printf
s in the worker to look into the intermediate results. I placed them during the implicit barriers between the loops. I could see that the value of private_out
went from initially 10
- to
-10
instead of0
between the loops and - to
1990
instead of1000
.
This just looks to me like both loops are being executed twice.
More results for convenience
To add some strangeness of this example: The code does not compile for some combinations of num_workers
/vector_length
. For e.g leaving num_workers
just at 64
and setting the vector_length
to 2
,4
,8
,16
and even to 32
(which increases the threads over the limit of 1024). It gives the error message
ptxas error : Entry function '_ZN1A14foo_298_gpu__1Ev' with max regcount of 32 calls function '_ZN1A11some_workerEi' with regcount of 41
However, simply inserting the printf
s as described above, it suddenly compiles fine but runs into a runtime error: "call to cuLaunchKernel returned error 1: Invalid value".
But the most strange is, that it compiles and runs fine for 64
/64
but returns incorrect results. Below is the output of this setting with NV_ACC_TIME=1
, but note that the output is almost exactly the same for all compiling and running configurations, except for the block: [1x#-######]
-part.
Accelerator Kernel Timing data
/path/to/src/main.cpp
_ZN1AC1Ev NVIDIA devicenum=0
time(us): 665
265: data region reached 1 time
265: data copyin transfers: 3
device time(us): total=665 max=650 min=4 avg=221
/path/to/src/main.cpp
_ZN1AD1Ev NVIDIA devicenum=0
time(us): 8
269: data region reached 1 time
269: data copyin transfers: 1
device time(us): total=8 max=8 min=8 avg=8
/path/to/src/main.cpp
_ZN1A3fooEv NVIDIA devicenum=0
time(us): 1,243
296: data region reached 2 times
298: compute region reached 2 times
298: kernel launched 2 times
grid: [1-1000] block: [1-32x1-24]
device time(us): total=1,230 max=1,225 min=5 avg=615
elapsed time(us): total=1,556 max=1,242 min=314 avg=778
304: update directive reached 1 time
304: data copyout transfers: 1
device time(us): total=13 max=13 min=13 avg=13