CUDA: Bigger problems in threads

Question

Almost all of the CUDA exemplar code describes doing near-atomic operations on large data sets. What kind of practical limitations are the to the size of a problem each thread can do?

For example, I have another question open at the minute that involves per-thread matrix solving. Is this kind of thing too large to put within each thread?

score 3 · Accepted Answer · answered Apr 14 '11 at 15:16

CUDA is a data parallel programming model for what is effectively an SIMD architecture, so obviously it isn't as flexible as a general purpose multithreaded or MIMD architecture. Certainly kernels can be a lot more complex than simple arithmetic operations.

In my own work I use CUDA a lot for solving partial differential equations (so the finite element, finite difference and finite volume methods), which every thread processes a cell or element from a discretised continuum. In that sort of calculation, there are a lot of FLOPs per thread per cell/element.

The key area to be mindful of is branch divergence. Because it is an SIMD architecture under the hood, code where there is a lot of branching within a warp of threads (which is effectively the SIMD width), will suffer performance penalties. But branch divergence and code complexity need not be synonymous, you can write very "branchy" and "loopy" code which will run well, as long as threads within any given warp don't diverge too often. In FLOP and IOP heavy algorithms, that is usually not too hard to achieve.

Thanks for all your CUDA help, if you ever come across a young bald irish geek, ask him if he owes you a beer, cus I do. Fortunately the matrix solution is the most complex bit of the thread computation, and I've eliminated all the non-border-case if's so warp divergence shouldnt be an issue. Looking at Clout decomposition from the C Numerical Recipes and when i get something working will update other question. — Bolster, Apr 14 '11 at 15:29

score 1 · Answer 2 · answered Apr 15 '11 at 20:18

I just want to reiterate talonmies and say that there is no real limit to the "size" of a kernel in number of operations. As long as the computation is parallel, CUDA will be effective!

As far a practical considerations, I would just add a few small notes

long running kernels can timeout, depending on os (or when profiling with cudaProf). You might have to change a setting somewhere to increase maximum kernel execution time.
long running kernels on systems without a dedicated gpu can freeze the display (interrupting ui).
warps are executed asynchronously - one warp can access memory while another performs arithmetic in order to use clock cycles effectively. long running kernels might benefit more from attention to this kind of optimization. i'm not really sure about this last one.

thanks for your comment, but how would one actually assess that without having a successful profile? — Bolster, Apr 16 '11 at 15:09
@andrew-boster What do you mean? You can certainly profile long running kernels, you just have to change the kernel timeout setting in cudaprof — jmilloy, Apr 16 '11 at 15:42

CUDA: Bigger problems in threads

2 Answers2