PC power down if batch size is increases beyond 18 on a 4 GPU machine.

Question

I have 4 NVIDIA 1080 GPUs (11GB each), 128GB RAM , and I am using a 1600w EVGA supernova P2 power supply in my lab. I am new to deep learning. I want to get a sense of what is normal behaviour during training in terms of the hardware.

I have 70000 medical images that are 256x256x3. I am doing end to end training with AlexNet.

If I set the batch size to anything more than 18 using 3 of my GPUs the computer powers down and then restarts. GPU burn works fine on all GPUs and if I use batches of 4-8 I can use all 4 GPUs. Despite all this, the temperature of the GPUs sticks at 70-75 with no more than 60% utilisation on each of the 3 GPUs.

Is this normal - I would have thought I could train batches of more generous proportions with this hardware.

Thanks.

Basile Starynkevitch · Answer 1 · 2017-12-02T12:55:16.800

1

That looks like some hardware problem. But check also various logs (output of dmesg, some /var/log/*log file).

Perhaps your PSU is slightly undersized.

Perhaps your cooling is insufficient and your computer is getting too hot. Is it sitting in some air-conditioned room?

NVIDIA GPUs are rumored to get quite hot.

If you have an ordinary desktop box, try to remove some cover plate to slightly lower the temperature (perhaps also open the window, if it is winter and cold enough outside). Check that your fans are working well (perhaps some BIOS settings...).

Use also some utilities (like yacpi, xsensors, etc...) to measure the temperature at several points (GPU, CPU, box, motherboard, whatever you can...).

Run also some GPU benchmarks (or code some easy ones in CUDA or in OpenCL) to load your GPU hardware. Be sure to test for failure of any GPU related code.

edited Dec 02 '17 at 12:55

answered Dec 02 '17 at 12:03

Basile Starynkevitch

223,805
18
296
547

No but it's the uk and winter. Temp readings are fine and I've been told that my power supply is among the best you get get. – GhostRider Dec 02 '17 at 12:04
Do these numbers look wrong - i.e. Should I be getting better performance than this? I'm guessing from my research that I should. – GhostRider Dec 02 '17 at 12:05
The GPU usage remains between 100 and 150w each. Its a 1600W supply. – GhostRider Dec 02 '17 at 12:06

PC power down if batch size is increases beyond 18 on a 4 GPU machine.

1 Answers1