Questions tagged [dynamic-parallelism]

dynamic parallelism refers to a capability in CUDA for device kernel launches to be performed from within a device kernel

This tag should be used for questions pertaining to CUDA dynamic parallelism. This refers to the capability for CUDA devices of compute capability 3.5 or higher to be able to launch a device kernel from within a device kernel. In addition, using this functionality requires the specification of certain CUDA compilation switches, such as the switch to enable relocatable device code, and the switch to link in the device runtime library.

50 questions
1
vote
3 answers

Generating Relocatable Device Code using Nvidia Nsight

I'm trying to compile a dynamic parallelism example on CUDA and when i try to compile it gives and error saying, kernel launch from __device__ or __global__ functions requires separate compilation modes Later found that I have to set the…
BAdhi
  • 420
  • 7
  • 19
1
vote
2 answers

Nested Directives in OpenACC

I'm trying to use nested feature of OpenACC to active dynamic parallelism of my gpu card. I've Tesla 40c and my OpenACC compiler is PGI version 15.7. My code is so simple. When I try to compile following code compiler returns me these messages…
grypp
  • 405
  • 2
  • 15
1
vote
2 answers

Understanding Dynamic Parallelism in CUDA

Example of dynamic parallelism: __global__ void nestedHelloWorld(int const iSize,int iDepth) { int tid = threadIdx.x; printf("Recursion=%d: Hello World from thread %d" "block %d\n",iDepth,tid,blockIdx.x); // condition to stop recursive…
John
  • 3,037
  • 8
  • 36
  • 68
1
vote
1 answer

Is it possible to call cublas functions from a device function?

In here Robert Crovella said that cublas routines can be called from device code. Although I am using dynamic parallelism and compiling with compute capability 3.5, I cannot manage to call Cublas routines from a device function. I always get the…
emartel
  • 49
  • 9
1
vote
1 answer

CUDA recursion depth

When using Dynamic Parallelism in CUDA, you can implement recursive algorithms like mergeSort. I have implemented it and my program don't work for inputs greater than blah. My question is how many depth in the recursion tree the implementation can…
AmirSojoodi
  • 1,080
  • 2
  • 12
  • 31
1
vote
1 answer

numba.typeinfer.TypingError: Untyped global name 'child_launch' when using CUDA Dynamic Parallelism in Python ( Anaconda ) on NVIDIA GPU

My code is here: import numpy as np from numbapro import cuda @cuda.autojit def child_launch(data): data[cuda.threadIdx.x] = data[cuda.threadIdx.x] + 100 @cuda.autojit def parent_launch(data): data[cuda.threadIdx.x] = cuda.threadIdx.x …
1
vote
1 answer

How to compile a .cu with dynamic parallelism?

I have 2 cpp files setup and functions, 6 .cu files main, flood, timestep, discharge, continuity and copy. I'm trying to compile this to the main call the cpp files and so the flood kernel global and then flood call timestep, discharge, continuity…
Seffrin
  • 23
  • 5
0
votes
1 answer

CUDA dynamic parallelism is computing sequentially

I need to write an application that computes some matrices from other matrices. In general, it sums outer products of rows of initial matrix E and multiplies it by some numbers calculated from v and t for each t in a given range. I am a newbie to…
0
votes
1 answer

How do I wait for child kernels to finish in a parent kernel before executing the rest of the parent kernel in CUDA dynamic parallelism?

So I need the runParatron children to fully finish before the next iteration of the for loop happens. Based on the results I am getting, I'm pretty sure that's not happening. For example, I have a print statement in runParatron that executes AFTER…
0
votes
1 answer

Can I copy files from Sharepoint to Azure Blob Storage using dynamic file path?

I am building a pipeline to copy files from Sharepoint to Azule Blob Storage at work. After reading some documentation, I was able to create a pipeline that only copies certain files. However, I would like to automate this pipeline by using dynamic…
0
votes
1 answer

CUDA dynamic parallelism: Access child kernel results in global memory

I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this: int aPayloads[32]; // Compute aPayloads start values here int* aGlobalPayloads =…
Silicomancer
  • 8,604
  • 10
  • 63
  • 130
0
votes
1 answer

Can a CUDA parent kernel launch a child kernel with more threads than the parent?

I'm trying to learn how to use CUDA Dynamic Parallelism. I have a simple CUDA kernel that creates some work, then launches new kernels to perform that work. Let's say I launch the parent kernel with only 1 block of 1 thread, like so: int nItems =…
Justin
  • 1,881
  • 4
  • 20
  • 40
0
votes
1 answer

Why is cudaLaunchCooperativeKernel() returning not permitted?

So I am using GTX 1050 with a compute capability of 6.1 with CUDA 11.0. I need to use grid synchronization in my program so cudaLaunchCooperativeKernel() is needed. I have checked my device query so the GPU does have support for cooperative groups.…
0
votes
1 answer

How to call a Thrust function in a stream from a kernel?

I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be called from a device kernel. The following code was…
heapoverflow
  • 264
  • 2
  • 12
0
votes
1 answer

Nvidia visual profiler not showing cudaMalloc() after kernel launch

I am trying to write a program that runs almost entirely on the GPU (with very little interaction with the host). initKernel is the first kernel that is being launched from the host. I use Dynamic parallelism to launch successive kernels from the…
progammer
  • 1,951
  • 11
  • 28
  • 50