Questions tagged [dynamic-parallelism]

dynamic parallelism refers to a capability in CUDA for device kernel launches to be performed from within a device kernel

This tag should be used for questions pertaining to CUDA dynamic parallelism. This refers to the capability for CUDA devices of compute capability 3.5 or higher to be able to launch a device kernel from within a device kernel. In addition, using this functionality requires the specification of certain CUDA compilation switches, such as the switch to enable relocatable device code, and the switch to link in the device runtime library.

50 questions
0
votes
1 answer

Synchronizing depth of nested kernels

Lets take the following code where there is a parent and child kernel. From said parent kernel we wish to start threadIdx.x child kernels in different streams to maximize parallel throughput. We then wait for those children with…
user2255757
  • 756
  • 1
  • 6
  • 24
0
votes
1 answer

compile multiple cuda files (that have dynamic parallelism) and MPI code

I have a bunch of .cu files that use dynamic parallelism (a.cu, b.cu, c.cu.., e.cu, f.cu), and a main.c file that uses MPI to call functions from a.cu on multiple nodes. I'm trying to write a make file to compile the executable, but I keep facing…
user2330963
  • 7
  • 2
  • 5
0
votes
1 answer

Synchronization in CUDA dynamic parallelism

I am testing dynamic parallelism with the following kernel, the one that gets the maximum value of an integer array using dynamic parallelism in a divide and conquer fashion: __global__ void getMax(int * arr, int ini, int fin, int * maxv) { if…
Matias Haeussler
  • 1,061
  • 2
  • 12
  • 25
0
votes
1 answer

How can I synchronize device-side command queues with host-side queues? clFinish() and markerWithWaitList gives invalid queue error

I'm using OpenCL 2.0 dynamic parallelism feature and have each workitem enqueue another kernel with single workitem. When work completion time of child kernel is high, parent kernel completes before children and memory consistency is not preserved…
huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
0
votes
1 answer

Using shared memory in Dynamic Parallelism CUDA

Question 1: Do I have to specify the amount of dynamic shared memory to be allocated at the launch of parent kernel if shared memory is only used by child kernel. Question 2: The following is my child kernel and parent kernel Parent…
Aliya Clark
  • 131
  • 9
0
votes
1 answer

Dynamic parallelism - passing contents of shared memory to spawned blocks?

While I've been writing CUDA kernels for a while now, I've not used dynamic parallelism (DP) yet. I've come up against a task for which I think it might fit; however, the way I would like to be able to use DP is: If block figures out it needs more…
einpoklum
  • 118,144
  • 57
  • 340
  • 684
0
votes
1 answer

"device-function-maxrregcount" message while compiling cuda code

I am trying to write a code which performs multiple vector dot product inside the kernel. I'm using cublasSdot function from cublas library to perform vector dot product. This is my code: using namespace std; __global__ void ker(float * a, float *…
starrr
  • 1,013
  • 1
  • 17
  • 48
0
votes
1 answer

Trouble compiling/running CUDA code involving dynamic parallelism

I am trying to use dynamic parallelism with CUDA, but I cannot go through the compilation step. I am working on a GPU with Compute Capability 3.5 and the CUDA version 7.5. Depending on the switches in the compile command I use, I am getting…
VincentN
  • 63
  • 6
0
votes
1 answer

Accessing CUDA built-in variable in child kernel

I'm trying to use Kepler's Dynamic Parallelism for one of my application. The global index of the thread (in the parent kernel) launching the child kernel is needed in the child kernel. In other words, I want to access the parent's built-in…
user3813674
  • 2,553
  • 2
  • 15
  • 26
0
votes
1 answer

Cublas not working within kernel once compiled to cubin using -G flag with nvcc

I have a CUDA kernel that looks like the following: #include #include #include extern "C" { __device__ float ONE = 1.0f; __device__ float M_ONE = -1.0f; __device__ float ZERO = 0.0f; …
Bam4d
  • 610
  • 3
  • 10
0
votes
1 answer

Do kernel-launched child kernels have the same warp size as host-launched kernels?

When a kernel block is launched from the host, it has a warp size of 32. Is it the same for child kernels launched via dynamic parallelism? My guess would be yes, but I haven't seen it in the docs. The larger question, of course, is: is it worth it?…
mmdanziger
  • 4,466
  • 2
  • 31
  • 47
0
votes
1 answer

cuda dynamic parallelism linkage error extern c

I'm trying to link my CUDA Kepler's Dynamic Parallelism program as follows: nvcc -m32 -arch=sm_35 -dc -Xcompiler '-fPIC' DFS_Solving.cu nvcc -m32 -arch=sm_35 -Xcompiler '-fPIC' -dlink DFS_Solving.o -o link.o gcc -shared -Wl,-soname,libdfs.so -o…
0
votes
1 answer

Dynamic Parallelism in CUDA not working

I wrote a simple code to understand Dynamic Parallelism. From the values being printed,I see that the child kernel has executed correctly, but when I come back to the parent kernel, I see wrong values being used in place of temp array which is being…
Jagannath
  • 47
  • 7
0
votes
1 answer

CUDA dynamic parallelism: invalid global write when using texture memory

I seem to have troubles when a kernel call within a kernel (even recursive call) uses texture memory to get a value. If the child kernel, say a different one, doesn't use texture memory, everything is fine. If I don't call a kernel within a kernel,…
salvaS
  • 13
  • 3
0
votes
1 answer

Nvidia Jetson TK1 Development Board - Cuda Compute Capability

I have quite impressed with this deployment kit. Instead of buying a new CUDA card, which might require new main board and etc, this card seems provide all in one. At it's specs it says it has CUDA compute capability 3.2. AFAIK dynamic parallelism…
phoad
  • 1,801
  • 2
  • 20
  • 31