3

I have CU file with a single kernel defined in it. The kernel calls a function which in turn calls one of two other. In total, all the functions combined is only around ~600 lines, however, some of those contain long mathematical expressions that are around ~1200 characters long. There are maybe 3 or 4 of these long expressions, 3 or 4 half that size and then a few comparatively short expressions. I am compiling this into a CUBIN file to be loaded at runtime in another program. The resulting CUBIN file is only around 800 kB.

Compiling this code for the host (in plain C) using gcc completes in less than a second. NVCC ends up taking ~20-30 minutes or more!

My command-line looks something like this:

nvcc -cubin -arch=sm_20 -m64 -ccbin g++ foo.cu -o foo.cubin -Xptxas -O0 -Xcompiler -Wno-unused-variable

What could be causing this? Is it possible to make it faster in any way?

Thomas Antony
  • 544
  • 1
  • 7
  • 17
  • There could be a lot of inlining of math functions, unrolling of loops, or template expansion, resulting in much larger intermediate code, leading to a slow compilation. How many lines does the intermediate PTX representation have? 800 KB of object code are on the order of 100,000 machine instructions, that is very large code. Can you show your source? Which CUDA version are you using? Recent tool chains incorporate some improvements for faster compilation of large codes. – njuffa Jun 12 '14 at 17:03
  • 1
    Without having seen any code, here are some generic pointers: (1) Use `#pragma unroll 1` in front of loops to prevent unrolling (2) Declare your functions with the `__noinline__` attribute to reduce inlining (3) Get rid of unneeded calls to `pow()`. The first two changes may make your code execute more slowly, while the third iten may make your code run faster. – njuffa Jun 12 '14 at 23:28
  • 1
    How does the compile time change when you use `-Xptxas -O3` (which is the default setting, so you could also just eliminate the current used flag)? – njuffa Jun 12 '14 at 23:32

0 Answers0