Trying to "fix" the problem by throttling the GPUs when you detect overheating is a Bad Idea.
You're operating on the ragged edge of the envelope, and even if you start throttling back at say 90 degrees (8 degrees before the "redline" that nVidia specifies) there's no guarantee you won't overshoot the limits of your cooling (and the hardware's safe operating range).
Down this road lies only misery - in the form of computation errors, hardware damage, and large repair/replacement bills.
Throttling the GPUs can help if you do it early enough.
You could throttle the GPUs all the time, preventing them from ever exceeding their maximum operating temperature. This will save your hardware, but you're crippling performance to keep the system at a safe temperature.
You could implement this with a PID algorithm that starts throttling the GPUs around say 80 degrees, to hold them at or below 90 degrees.
Presumably though you're spending a lot of money on this compute farm -- throttling it kinda defeats the purpose (getting results fast).
Fixing your cooling problem is the only Real Solution.
Like the commenters pointed out, your core problem is bad/insufficient cooling.
We don't know WHY you have insufficient cooling, and the solutions would depend on the underlying cause.
- If the case has poor airflow you can add blowers to move a higher volume of air through the system.
- If your datacenter has poor cooling airflow you can redesign your room to ensure the intake air is cooler.
- If your datacenter is chronically overheated you may need to add more cooling (however much is necessary to handle your heat load).