CUDA: Is it possible to dynamically restrict the number of cores / threads / clock freq. while a process is running on a GPU?

Question

I'm running multiple NVidia GTX 680 under Ubuntu 10.04 in a pretty hot environment (troubles with rack cooling) and sometimes it's getting over 95C. When I detect the overheating, can I somehow tell the driver to reduce the used resources, e.g.

number of threads
number of cores
GPU clock frequency
memory clock frequency
..?

dynamically, without restarting the process, so that the GPU can cool down a little? Perhaps there is something like nvidia-smi or nvidia-settings that would allow me to do so? The only thing is: I need to do so externally, without modifying the actual code.

The process runs several days and performs some scientific calculations without any graphical output, so it would be fine if the matrix multiplication would slow down for some time.

What you're trying to do is a work-around, fix the root cause - which is insufficient cooling. I use HP SL270s Gen8's with 8 x M2090's per server (i.e. 4096 CUDA cores) and they never get near to 80 degrees C, you need to cool better. — Chopper3, Jul 23 '13 at 12:23
I'd appreciate a work-around though. Proper cooling is a different topic, but for a moment I'm looking for a way to control the GPU resources as I described. — Pavel, Jul 23 '13 at 13:17
nVidia states the GTX 680 can run safely up to 98º, you're much too close to the max. You can underclock the GPU, which will result in it running cooler, usually this functionality is built into the calculation software itself.. — Chris S, Jul 23 '13 at 13:40
Yes, this was the first thing I thought of, but unfortunately I can not modify the code to reduce the load. That's why I'm curious if it's possible to do this on the Linux kernel level (tell scheduler to give the process less shares?) or the CUDA driver (same idea). — Pavel, Jul 23 '13 at 13:53
The Linux kernel doesn't really have any say in the process. The CUDA driver provides clocking functionality, but the application would have to support it. Try the `nvclock` utility. Also, we've got a couple system builders around the site (Chopper being one) who run many cards in a single rack server without these issues. You're definitely better off fixing your cooling problems. — Chris S, Jul 23 '13 at 14:02
Of course you're right, working cooling is absolutely important. Thank you for the pointer to `nvclock`! — Pavel, Jul 23 '13 at 14:04
Not to mention that the [GPU already throttles itself automatically](http://nvidia.custhelp.com/app/answers/detail/a_id/2752/~/nvidia-gpu-maximum-operating-temperature-and-overheating) when it gets too hot. You don't need to do anything but work on the cooling problem. — Michael Hampton, Jul 23 '13 at 20:14
have you seen this [secHighTmprMon.sh](http://sourceforge.net/p/scriptechocolor/git/ci/master/tree/ScriptEchoColor/bin.extras/secHighTmprMon.sh)? it throttles CPU based on its temperature, and I think there is GPU throttling planned too. — Aquarius Power, Oct 24 '14 at 16:57

score 1 · Answer 1 · answered Jul 23 '13 at 18:15

Trying to "fix" the problem by throttling the GPUs when you detect overheating is a Bad Idea.
You're operating on the ragged edge of the envelope, and even if you start throttling back at say 90 degrees (8 degrees before the "redline" that nVidia specifies) there's no guarantee you won't overshoot the limits of your cooling (and the hardware's safe operating range).

Down this road lies only misery - in the form of computation errors, hardware damage, and large repair/replacement bills.

Throttling the GPUs can help if you do it early enough.
You could throttle the GPUs all the time, preventing them from ever exceeding their maximum operating temperature. This will save your hardware, but you're crippling performance to keep the system at a safe temperature.
You could implement this with a PID algorithm that starts throttling the GPUs around say 80 degrees, to hold them at or below 90 degrees.

Presumably though you're spending a lot of money on this compute farm -- throttling it kinda defeats the purpose (getting results fast).

Fixing your cooling problem is the only Real Solution.
Like the commenters pointed out, your core problem is bad/insufficient cooling.

We don't know WHY you have insufficient cooling, and the solutions would depend on the underlying cause.

If the case has poor airflow you can add blowers to move a higher volume of air through the system.
If your datacenter has poor cooling airflow you can redesign your room to ensure the intake air is cooler.
If your datacenter is chronically overheated you may need to add more cooling (however much is necessary to handle your heat load).

Thank you for your answer! You're right with everything you said, but I'm still looking for some technical details of how to perform the throttling. `nvclock` was a good start, but it seems that it doesn't support my video cards. Perhaps there are other ways? — Pavel, Jul 24 '13 at 13:04
@Pavel You will need to research your card and determine what tools (if any) are available that support it. — voretaq7, Jul 24 '13 at 16:01

CUDA: Is it possible to dynamically restrict the number of cores / threads / clock freq. while a process is running on a GPU?

1 Answers1