I am very new to the CUDA programming model and programming in general, I suppose. I'm attempting to parallelize an expectation maximization algorithm. I am working on a gtx 480 which has compute capability 2.0. At first, I sort of assumed that there's no reason for the device to launch its own threads, but of course, I was sadly mistaken. I came across this pdf.
http://docs.nvidia.com/cuda/pdf/CUDA_Dynamic_Parallelism_Programming_Guide.pdf
Unfortunately, dynamic parallelism only works on the latest and greatest GPUs, with compute capability 3.5. Without diving into too much specifics, what is the alternative to dynamic parallelism? The loops in the CPU EM algorithm have many dependencies and are highly nested, which seems to make dynamic parallelism an attractive ability. I'm not sure if my question makes sense so please ask if you need clarification.
Thank you!