Crashing a kernel gracefully

Question

A follow up to: CUDA: Stop all other threads

I'm looking for a way to exit a kernel if a "bad condition" occurs. The prog manual say NVCC does not support exception handling. I'm wondering if there is a user defined cuda-error-code. In other words if "bad" happens, then terminate with this user error code. I doubt there is one, so my other idea would be to cause one.

Something like, if "bad" happens, divide by zero. But I'm unsure if one thread does a divide-by-zero, is that enough to crash the whole kernel, or just that thread?

Is there a better approach to terminating a kernel?

If your main use for this is debugging, CUDA has assert support on Fermi and Kepler . It kills your context, but it will give a useful assert message on the way out, or drop you into the code where the assertion failed is you run your app in the debugger — talonmies, Sep 21 '12 at 04:53

score 8 · Accepted Answer · edited May 23 '17 at 11:55

You should first read this question and the answers by harrism and tera (asked/answered yesterday).

You may be tempted to use something like

if (there_is_an_error) {
  *status = MY_ERROR_CODE; // store to device pointer
  __threadfence();         // ensure store issued before trap
  asm("trap;");            // kill kernel with error
}

This does not exactly satisfy your condition of "graceful", in my opinion. Trap causes the kernel to exit and the runtime to report cudaErrorUnknown. But since kernel execution is asynchronous, you will need to synchronize your stream / device in order to catch this error, which means synchronizing after every kernel call, unless you are OK with having imprecise errors (i.e. you may not catch the error code until after calls to subsequent CUDA API calls).

But this is just the way kernel error handling is in CUDA, and well-written codes should be synchronizing in debug builds to check kernel errors, and settling for imprecise error messages in release builds. Unfortunately, I don't think there is a more graceful way than that.

edit: on Compute capability 2.0 and later you can use assert() to exit with an error in debug builds. It was unclear if this is what you want though.

score 1 · Answer 2 · answered Sep 21 '12 at 07:53

1

The assertion may help you. You could find it in B.15 of CUDA C Programming Guide.

answered Sep 21 '12 at 07:53

sjiagc

56
1

Crashing a kernel gracefully

2 Answers2

Linked