0

Since its possible to use Clang for CUDA compilation, I am interested in studying about the cuda code (.cu file) conversion by the clang to intermediate representation (IR).

The CUDA compilation by Clang require certain CUDA libraries. So is the parsing of the keyword __shared__ in CUDA program is done by Clang or by the CUDA compiler? From my initial searches, I believe the conversion is done by CUDA and not Clang. Is this understanding correct?

talonmies
  • 70,661
  • 34
  • 192
  • 269
Ginu Jacob
  • 1,588
  • 2
  • 19
  • 35

1 Answers1

3

When clang compiles CUDA code the Nvidia NVCC compiler is not involved.

__shared__ or more accurately __attribute__((shared)) is an attribute clang knows. If clang encounters a variable marked with the shared attribute it will do two things:

  1. The variable will have static linkage. This means that the definition of the variable moves from the kernel function to the module scope.
  2. The variable will be placed in address space 3 which is defined as the shared memory address space.

Compiling this little program with clang:

__global__ void foo(int* tmp)
{
  __shared__ int vec[32];
  vec[threadIdx.x] = tmp[threadIdx.x];
  tmp[threadIdx.y] = vec[threadIdx.y];
}

int main()
{
  int* tmp;
  foo<<<1, 1>>>(tmp);
  return tmp[0];
}

results in the following IR:

  ; ModuleID = 'sm.cu'
  target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
  target triple = "nvptx64-unknown-unknown"

  @vec= internal unnamed_addr addrspace(3) global [32 x i32] zeroinitializer, align 4

  ; Function Attrs: nounwind readnone
  declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() #0

  ; Function Attrs: nounwind readnone
  declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() #0

  define ptx_kernel void @__pacxx_kernel0(i32 addrspace(1)* %tmp) {
    %1 = tail call spir_func i32 @llvm.nvvm.read.ptx.sreg.tid.x() #1
    %2 = zext i32 %1 to i64
    %3 = getelementptr i32, i32 addrspace(1)* %tmp, i64 %2
    %4 = load i32, i32 addrspace(1)* %3, align 4
    %5 = getelementptr [32 x i32], [32 x i32] addrspace(3)* @vec, i64 0, i64 %2
    store i32 %4, i32 addrspace(3)* %5, align 4
    %6 = tail call spir_func i32 @llvm.nvvm.read.ptx.sreg.tid.y() #1
    %7 = zext i32 %6 to i64
    %8 = getelementptr [32 x i32], [32 x i32] addrspace(3)* @vec, i64 0, i64 %7
    %9 = load i32, i32 addrspace(3)* %8, align 4
    %10 = getelementptr i32, i32 addrspace(1)* %tmp, i64 %7
    store i32 %9, i32 addrspace(1)* %10, align 4
    ret void
  }

You can see the variable vec has static (but internal) linkages inside of the module and resides in address space 3.

Clang follows the NVVM IR specifications which can be found here. However, NVVM IR is specified for LLVM 3.4 and you may encounter problems if you use IR generated by newer LLVM/Clang versions. The NVPTX backend from LLVM however, does not have this restrictions and can generate PTX code without problems. Clang (in newer versions) will build a fat bin just like NVCC does it. In older versions of Clang you have to build your executable on your own and compile the device part of the program with the CUDAIsDevice command line flag.

The PTX code can than be used to program a GPU by linking it with the CUDA API.

EDIT: Since the question comes where the __shared__ attribute is defined here is where: in the clang headers host_defines.h is included from the CUDA Toolkit. In host_defines.h (from CUDA 7.5) you can see:

  192 #define __shared__ \
  193         __location__(shared)

and __location__ (which is another macro definition) expands to __annotate__

   85 #define __annotate__(a) \
   86         __attribute__((a))
   87 #define __location__(a) \
   88         __annotate__(a)

which is expanded to __attribute__ as I wrote in the first part of the answer. So __shared__ is expanded to __attribute__((shared)).

Michael Haidl
  • 5,384
  • 25
  • 43
  • While compiling CUDA code (.cu files) using clang we use CUDA libraries. If Clang does the complete conversion of the .cu file to IR, what is the purpose of using CUDA libraries for the compilation... – Ginu Jacob Jan 13 '16 at 06:22
  • 2
    Clang generates calls to CUDA libraries in the final executable to load and execute the kernels. – Michael Haidl Jan 13 '16 at 06:24
  • I can see #define for __shared__ as __attribute__((shared)) in CUDA libraries and also in clang/test/ and clang/unittest/ folders. But for the conversion from .cu to IR which of this may be used.... – Ginu Jacob Jan 13 '16 at 06:30
  • 2
    Clang brings a internal header called "cuda_runtime.h" it includes all necessary parts from CUDA. the other defines are just for testing purpose only as they are located in the unit test directories of clang. – Michael Haidl Jan 13 '16 at 06:37
  • In the clang folder, I can see __clang_cuda_runtime_wrapper.h but not cudaruntime.h. But the file matches almost the contents of cudaruntime.h which i found online http://clang.llvm.org/doxygen/cuda__runtime_8h_source.html but I hardly see any declarations for __shared__ – Ginu Jacob Jan 21 '16 at 02:07
  • One specific question is that how __attribute__((shared)) gets mapped to the address space (3). I can see the address spaces for various CUDA memory types in the file /llvm/lib/Target/NVPTX/MCTargetDesc/NVPTXBaseInfo.h as enum AddressSpace { ADDRESS_SPACE_GENERIC = 0, ADDRESS_SPACE_GLOBAL = 1, ADDRESS_SPACE_SHARED = 3, ADDRESS_SPACE_CONST = 4, ADDRESS_SPACE_LOCAL = 5, // NVVM Internal ADDRESS_SPACE_PARAM = 101 }; and the the corresponding switch cases in /llvm/lib/Target/NVPTX/NVPTXISelDAGToDAG.cpp case ADDRESS_SPACE_SHARED: – Ginu Jacob Feb 12 '16 at 03:36
  • (continue) But the problem i am interested is at which stage of the conversion the clang identifies __attribute__((shared)) as address space (3)... – Ginu Jacob Feb 12 '16 at 03:42
  • in other words which files in the clang code may give a better insight if this. If we consider the example of OPENCL keywords, they are defined in the file ./../../llvm/tools/clang/include/clang/Basic/Attr.td (e.g. KEYWORD(__global , KEYOPENCL) ) and the corresponding cases can be observed in ./../../llvm/tools/clang/lib/Parse/ParseDecl.cpp (e.x. case tok::kw___global:) – Ginu Jacob Feb 12 '16 at 03:49
  • please ask a new question. your comments here would likely be missed by someone else having this particular question. – Michael Haidl Feb 12 '16 at 06:37
  • Posted at http://stackoverflow.com/questions/35356840/conversion-from-attribute-shared-to-addrspace3-in-clang-compiler-when – Ginu Jacob Feb 12 '16 at 07:17
  • Is the above IR provided by you a .bc file or a .ll file ? I use the command sudo clang++ -emit-llvm -S -o outputIRFileName -I/home/ginu001/NVIDIA_CUDA-7.5_Samples/common/inc -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib64/stubs CUDACODE.cu -lcudart_static -lcuda -ldl -lrt -pthread to generate the IR and this gives the .ll format. There I was unable locate any addrspace variables.... If your example is a .bc file it would be great if you can share the compilation command. I tried to make the .bc file from cuda but the efforts went invain. – Ginu Jacob Feb 18 '16 at 16:26
  • 1
    you have to specify an nvptx target – Michael Haidl Feb 18 '16 at 16:40