0

I'm trying to parallelize my sequencial C code and offload to NVIDIA GPU with OpenACC(PGI compiler)

My code is written as a sequencial code. And calling very long functions frequently, like below.

int main()
{
   // blah blah...
   for(i=0; i<10; i++)
   {
      for(j=0; j<20; j++)
      {
          big_function(a,b,c);
      }
   }
   // blah blah...
}

int big_function(a,b,c)
{
   small_function_1(a);
   small_function_2_with_data_dependencies(b);
}

That kind of case case, big_function() can parallelize and run on GPU?

I declared whole of for loop to parallized region using #pragma acc kernels . like below.

#pragma acc routine
int big_function(int a, int b, int c);
#pragma acc routine
int small_function_1(int a);
#pragma acc routine
int small_function_2_with_data_dependencies(int b);

int main()
{
   // blah blah...
   #pragma acc data ~~~~
   #pragma acc kernels
   for(i=0; i<10; i++)
   {
      for(j=0; j<20; j++)
      {
          big_function(a,b,c);
      }
   }
   // blah blah...
}

int big_function(a,b,c)
{
   small_function_1(a);
   small_function_2_with_data_dependencies(b);
}

But the compiled file takes very long time to finish. And the result was not correct.

Can I use OpenACC to parallelize sequecial code which using many function calls?

Or Do I have to break and divide big_function() to small parts?

soongk
  • 259
  • 3
  • 17

2 Answers2

1

You will need to decorate each function down your call tree with the acc routine directive, like you've done in your example. If you're expecting all of the parallelism to come from the loops at the top level, then you'll want all of the routines to be marked as sequential (seq). As long as you've done that, the compiler should be able to build it for the GPU. It's very likely that you'll get poor performance though, as large function call trees like this tend to contain a lot of state, which eats away at the GPU resources, shared memory and registers in particular. You'll probably find that it'd perform a lot better on GPUs if you move the parallelism down the call tree, but that has the potential to negatively affect CPU performance and possibly increase memory usage as you have to save off data that was previously available as thread state.

If you can provide more information about the actual code, I can try to help you debug the correctness problems. You should check the compiler feedback (-Minfo) and make sure the compiler is doing what you think it's doing. You might find out that it's getting tripped up by the call tree. You might also try the PGI forums, as they're often very responsive to help queries there.

jefflarkin
  • 1,279
  • 6
  • 14
  • Thank you for your reply. I'm sorry but I can't provide my actual code, because it's my company project. – soongk Aug 10 '15 at 01:09
  • I have one more question. Why does the long function occur poor performance? The same code runs on CPU very well, and nice performance. Is there some different things to run code through CPU and GPU? – soongk Aug 10 '15 at 01:11
  • 1
    On a CPU you have a relatively small number of threads running at a time, so maintaining the state for each of those threads doesn't require a lot of resources. This is because switching threads is expensive. On a GPU you have orders in magnitude more threads at any given time, so the amount of state that has to be maintained grows as well, resulting in using a lot of registers, shared memory, and local memory on the GPU. As these increase, fewer threads can be resident on the GPU at a time, something known as low occupancy, so the GPU won't have as much parallelism to exploit for performance. – jefflarkin Aug 10 '15 at 13:59
0

It depends on the depth of your calltree. As jefflarkin said, acc routine can help you, but it only goes so far. In general, these routines need to be inlined to create a big kernel. GPUs aren't really built to handle complex kernels with thousands of lines of code - i.e. even if it works it will be hard to get it performant.

The way to do it in a more complex case is to privatize your callgraph (which I assume is the physical parametrization of some simulation) in the i,j-domains. I.e. instead of computing everything for one column or surface point you pass down higher dimensional data to your subroutines, so you can parallelize smaller chunks in i,j.

Sidenote: For Fortran 90+ I've built a tool that does the parallelization for you, but I'm afraid it doesn't support C++. Maybe it will inspire you for a preprocessing solution though. In my case I needed to retain CPU performance, which may suffer with the solution I've proposed above, but this may not apply in your case.

Michel Müller
  • 5,535
  • 3
  • 31
  • 49