Alternative for dynamic parallelism for CUDA

Question

I am very new to the CUDA programming model and programming in general, I suppose. I'm attempting to parallelize an expectation maximization algorithm. I am working on a gtx 480 which has compute capability 2.0. At first, I sort of assumed that there's no reason for the device to launch its own threads, but of course, I was sadly mistaken. I came across this pdf.

http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf

Unfortunately, dynamic parallelism only works on the latest and greatest GPUs, with compute capability 3.5. Without diving into too much specifics, what is the alternative to dynamic parallelism? The loops in the CPU EM algorithm have many dependencies and are highly nested, which seems to make dynamic parallelism an attractive ability. I'm not sure if my question makes sense so please ask if you need clarification.

Thank you!

In dynamic parallelism, you have kernel calls inside kernels. If you stipulate that a kernel call can be exchanged with a `for` loop, very roughly speaking you should exchange the nested kernel calls by `for` loops. — Vitality, Jul 10 '13 at 18:52
@RobertCrovella You guys should really do something about the naming of the GPUs. A GT 640 comes as 2.1, 3.0 and 3.5 ? https://developer.nvidia.com/cuda-gpus — Pavan Yalamanchili, Jul 11 '13 at 02:45
@RobertCrovella The GT 640 no longer shows as compute 3.5. May be you want to retract that comment. — Pavan Yalamanchili, Jul 12 '13 at 19:35
There are pleny of references on the web to a GT640 with GK208, as well as other products besides the one I linked, such as [this one](http://www.ask-corp.jp/products/leadtek/nvidia-graphicsboard/geforce-gt640/winfast-gt640-gk208-lp-gd5-1g.html). I'm investigating further, but I'm quite confident that: 1. There is a "relatively new" version of [GT640 that uses GK208](http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units), and 2. GK208 is [a compute 3.5 device](http://technewspedia.com/gk208-the-new-low-end-gpu-nvidia/). In the meantime, I've deleted the comment. — Robert Crovella, Jul 12 '13 at 21:07
I did purchase this [ASUS GT640 product](http://www.newegg.com/Product/Product.aspx?Item=N82E16814121771) and just received and installed today, and I confirm it shows up in deviceQuery as a compute 3.5 capable device. Less than $100. — Robert Crovella, Jul 22 '13 at 23:16

score 2 · Answer 1 · answered Jul 11 '13 at 03:00

As indicated by @JackOLantern, dynamic parallelism can be described in a nutshell as the ability to call a kernel (i.e. a __global__ function) from device code (a __global__ or __device__ function).

Since the kernel call is the principal method by which the machine spins up multiple threads in response to a single function call, there is really no direct alternative that provides all the capability of dynamic parallelism in a device that does not support it (ie. pre cc 3.5 devices).

Without dynamic parallelism, your overall code will almost certainly involve more synchronization and communication between CPU code and GPU code.

The principal method would be to realize some unit of your code as parallelizable, convert it to a kernel, and work through your code in essentially a non-nested fashion. Repetetive functions might be done via looping in the kernel, or else looping in the host code that calls the kernel.

For a pictorial example of what I am trying to describe, please refer to slide 14 of this deck which introduces some of the new features of CUDA 5 including dynamic parallelism. The code architecture on the right is an algorithm realized with dynamic parallelism. The architecture on the left is the same function realized without dynamic parallelism.

Kristóf Szalay · Answer 2 · 2013-07-11T10:06:36.017

0

I have checked your algorithm in Wikipedia and I'm not sure you need dynamic parallelism at all.

You do the expectation step in your kernel, __syncthreads(), do the maximization step, and __syncthreads() again. From this distance, the expectation looks like a reduction primitive, and the maximization is a filter one.

If it doesn't work, and you need real task parallelism, a GPU may not be the best choice. While the Kepler GPUs can do that to some degree, this is not what this architecture is designed for. In that case you might be better off using a multi-CPU system, such as an office grid, a supercomputer, or a Xeon Phi accelerator. You should also check OpenMP and MPI, these are the languages used for task-parallel programming (actually OpenMP is just a handful of pragmas in most cases).

edited Jul 11 '13 at 10:06

answered Jul 11 '13 at 09:54

Kristóf Szalay

1,177
1
8
19

Let me state beforehand that I'm not very familiar with expectation maximization. Nevertheless, it seems to me that, apart from particular cases, maximization should be dealt with as an optimization problem in which the expectation is calculated at each iteration step. Depending on the number of parameters you have to optimize, GPU massive parallelism could be useful to calculate the functional to be optimized along with its derivatives to achieve a fast and accurate convergence. – Vitality Jul 11 '13 at 12:45
@Kristof, The current implementation I am working with uses open MPI but it's not quite as fast as it has to be for what I'm working on. – user2529048 Jul 11 '13 at 19:30

Alternative for dynamic parallelism for CUDA

2 Answers2