How to Properly Recover from Memory Errors in GPU?

Question

Consumer-grade Nvidia GPUs are expected to have about 1-10 soft memory errors per week.

If you somehow manage to detect an error on a system without ECC (e.g. if the results were abnormal) what steps are necessary and sufficient to recover from it?

Is it enough to just reload all of the data to the GPU (cuda.memcpy_htod in PyCuda), or do you need to reboot the system? What about the "kernel", rather than data?

score 2 · Accepted Answer · answered Sep 17 '13 at 15:43

2

A soft memory error (meaning incorrect results due to noise of some kind), shouldn't require a reboot. Just rewind back to some known good position, reload data to the GPU and proceed.

answered Sep 17 '13 at 15:43

Levi Barnes

357
3
12

score 1 · Answer 2 · answered Sep 17 '13 at 16:55

Of course, it depends on what was located in the memory that was corrupted. I have accidentally overwritten memory on GPUs that required a reboot to fix, so it seems that could happen if memory is randomly corrupted as well. I think the GPU drivers reside partially in GPU memory.

For critical calculations, one can guard against soft memory errors by running the same calculation twice (including memory copies, etc) and comparing the result.

Since the compute cards with ECC are often more than twice as expensive as the graphics cards, it may be less expensive to just purchase two graphics cards and run the same calculations on both and compare all results. That has the added benefit of enabling doubling the calculation speed for non-critical calculations.

Would reloading the driver (`modprobe -r ...; modprobe ...` on Linux) have fixed this as well? — MWB, Sep 17 '13 at 17:40

How to Properly Recover from Memory Errors in GPU?

2 Answers2