Kernel seem not to execute

Question

I'm a beginner when it comes to CUDA programming, but this situation doesn't look complex, yet it doesn't work.

#include <cuda.h>
#include <cuda_runtime.h>

#include <iostream>

__global__ void add(int *t)
{
    t[2] = t[0] + t[1];
}

int main(int argc, char **argv)
{
    int sum_cpu[3], *sum_gpu;

    sum_cpu[0] = 1;
    sum_cpu[1] = 2;
    sum_cpu[2] = 0;

    cudaMalloc((void**)&sum_gpu, 3 * sizeof(int));

    cudaMemcpy(sum_gpu, sum_cpu, 3 * sizeof(int), cudaMemcpyHostToDevice);

    add<<<1, 1>>>(sum_gpu);

    cudaMemcpy(sum_cpu, sum_gpu, 3 * sizeof(int), cudaMemcpyDeviceToHost);

    std::cout << sum_cpu[2];

    cudaFree(sum_gpu);

    return 0;
}

I'm compiling it like this

nvcc main.cu

It compiles, but the returned value is 0. I tried printing from within the kernel and it won't print so I assume i doesn't execute. Can you explain why?

Add [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) to your code. You can also try running your code with `cuda-memcheck`. — Robert Crovella, Feb 07 '15 at 17:23
Thank you for the hints. Error checking reports 'unknown error' on the first cudaMalloc. `cuda-memcheck` detects 0 errors. — wiktus239, Feb 07 '15 at 17:34
You have a machine configuration problem. CUDA is not functional on that machine because it has not been installed correctly, or because of some other machine problem. You might want to carefully follow the instructions in [the getting started guide appropriate for your OS](http://docs.nvidia.com/cuda/index.html#getting-started-guides) including the verification steps. — Robert Crovella, Feb 07 '15 at 17:42

n2o · Accepted Answer · 2015-02-11T09:57:55.903

I checked your code and everything is fine. It seems to me, that you are compiling it wrong (assuming you installed the CUDA SDK properly). Maybe you are missing some flags... That's a bit complicated in the beginning I think. Just check which compute capability your GPU has.

As a best practice I am using a Makefile for each of my CUDA projects. It is very easy to use when you first correctly set up your paths. A simplified version looks like this:

NAME=base
# Compilers
NVCC = nvcc
CC = gcc
LINK = nvcc
CUDA_INCLUDE=/opt/cuda
CUDA_LIBS= -lcuda -lcudart
SDK_INCLUDE=/opt/cuda/include
# Flags
COMMONFLAGS =-O2 -m64
NVCCFLAGS =-gencode arch=compute_20,code=sm_20 -m64 -O2
CXXFLAGS =
CFLAGS =
INCLUDES = -I$(CUDA_INCLUDE)
LIBS = $(CUDA_LIBS)
ALL_CCFLAGS :=
ALL_CCFLAGS += $(NVCCFLAGS)
ALL_CCFLAGS += $(addprefix -Xcompiler ,$(COMMONFLAGS))
OBJS = cuda_base.o
# Build rules
.DEFAULT: all

all: $(OBJS)
    $(LINK) -o $(NAME) $(LIBS) $(OBJS)
%.o: %.cu
    $(NVCC) -c $(ALL_CCFLAGS) $(INCLUDES) $<
%.o: %.c
    $(NVCC) -ccbin $(CC) -c $(ALL_CCFLAGS) $(INCLUDES) $<
%.o: %.cpp
    $(NVCC) -ccbin $(CXX) -c $(ALL_CCFLAGS) $(INCLUDES) $<
clean:
    rm $(OBJS) $(NAME)

Explanation

I am using Arch Linux x64

the code is stored in a file called cuda_base.cu
the path to my CUDA SDK is /opt/cuda (maybe you have a different path)
most important: Which compute capability has your card? Mine is a GTX 580 with maximum compute capability 2.0. So I have to set as an NVCC flag arch=compute_20,code=sm_20, which stands for compute capability 2.0

The Makefile needs to be stored besides cuda_base.cu. I just copy & pasted your code into this file, then typed in the shell

$ make
nvcc -c -gencode arch=compute_20,code=sm_20 -m64 -O2 -Xcompiler -O2 -Xcompiler -m64 -I/opt/cuda cuda_base.cu
nvcc -o base -lcuda -lcudart cuda_base.o
$ ./base
3

and got your result.

Me and a friend of mine created a base template for writing CUDA code. You can find it here if you like.

Hope this helps ;-)

It would be very nice to hear, why people are downvoting my answer. I really focused on this question, checked the code, executed it on my machine and found, that it is essential to set the correct flags for the nvcc in some cases. As I am using Linux, it is convenient to use a Makefile to compile it. This is a clean solution to write CUDA Code and compile it on the Terminal. — n2o, Feb 16 '15 at 10:09

score 0 · Answer 2 · answered Feb 10 '15 at 22:31

I've had the exact same problems. I've tried the vector sum example from 'CUDA by example', Sanders & Kandrot. I typed in the code, added the vectors together, out came zeros.

CUDA doesn't print error messages to the console, and only returns error codes from the functions like CUDAMalloc and CUDAMemcpy. In my desire to get a working example, I didn't check the error codes. A basic mistake. So, when I ran the version which loads up when I start a new CUDA project in Visual Studio, and which does do error checking, bingo! an error. The error message was 'invalid device function'.

Checking out the compute capability of my card, using the program in the book or equivalent, indicated that it was...

... wait for it...

1.1

So, I changed the compile options. In Visual Studio 13, Project -> Properties -> Configuration Properties -> CUDA C/C++ -> Device -> Code Generation.

I changed the item from compute_20,sm_20 to compute_11,sm_11. This indicates that the compute capability is 1.1 rather than the assumed 2.0.

Now, the rebuilt code works as expected.

I hope that is useful.

Kernel seem not to execute

2 Answers2