0

This will be a bit of a funky question I assume and if I need to elaborate, please say so.

The situation is as follows: I have about 2 gigs of GPU memory containing my random numbers and I need to use those in many different functions. To prevent passing around the pointers to this memory, from device function to device function (and this many times over), I put the pointers in the gpu constant memory, which is also saving me registers (for me very important). Now I know that functions can be sped up in some cases if they are explained that memory chunks pointed to by it's arguments are non-overlapping, by using the keyword __restrict__.

The question: how can I make sure the compiler knows that the memory chunks in global memory pointed to by the pointers in constant memory are non-overlapping (and maybe also nice to know: not ever changing after the generate randoms kernel call)?

ikku100
  • 809
  • 1
  • 7
  • 16
  • 2
    How is putting the pointers in constant memory saving registers? And you understand that on all currently supported architectures, kernel arguments are passed in constant memory anyway, so using constant memory to "prevent passing around the pointers" by does nothing different than using a conventional argument kernel argument list, except eliminates the possibility of using `__restrict__`. – talonmies Aug 04 '15 at 17:09
  • The arguments are not only to the kernel call, but also to many functions used inside the kernels. So from device function to device function we were passing around these pointers. I assumed inlining the functions would reduce the number of registers but I found out that by reducing the number of arguments I really improved the number. You have however answered my question I think: the answer is that I cannot, right? If so, please make it an answer and I can accept. I've edited my question to be clearer, thanks for your comment and potential answer. – ikku100 Aug 05 '15 at 10:27
  • The compiler will inline expand `__device__` functions by default, so unless you are explicitly forcing the use of the ABI to call device functions, argument lists to device functions are eliminated by the compiler. Which means that standard kernel argument passing with `__restrict__` will do what you want without having to do anything. Otherwise, there is no way I know of to provide compile time heuristics on otherwise anonymous pointers. – talonmies Aug 05 '15 at 10:31
  • I am afraid that quite a few of our functions are not inlined by default (I checked and `__forceinline__` really did make it faster (thanks btw, I already own you one for this). FYI, I also added `__restrict__` to the `__constant__` pointers, but that didn't change anything (but it didn't throw errors either, which I expected). If you want, you can make your first comment about that it's not possible, an answer? Else I will refer to it as an answer, so others won't overlook it. – ikku100 Aug 05 '15 at 14:51

1 Answers1

0

I am not aware of a way to provide the compiler with heuristics on otherwise anonymous pointers.

If you can manage it, the simplest way to try and help the compiler do its job is to pass the pointers as __restrict__ decorated kernel arguments and then force device functions inline. That will bypass the ABI and may allow the compiler to exploit the known non-aliasing condition to optimise memory access patterns. It should also help with the register footprint of your functions a bit. I'm not sure that __restrict__ will have much effect on __device__ functions or __constant__ declarations, but you have noted that the compiler accepts it, so I guess it can't hurt to at least try.

I would look forward to comments from one of NVIDIA's toolchain or optimisation gurus on what might go on under the hood and what other tricks might be useful in this case.

talonmies
  • 70,661
  • 34
  • 192
  • 269