-1

I have 4 NVIDIA 1080 GPUs (11GB each), 128GB RAM , and I am using a 1600w EVGA supernova P2 power supply in my lab. I am new to deep learning. I want to get a sense of what is normal behaviour during training in terms of the hardware.

I have 70000 medical images that are 256x256x3. I am doing end to end training with AlexNet.

If I set the batch size to anything more than 18 using 3 of my GPUs the computer powers down and then restarts. GPU burn works fine on all GPUs and if I use batches of 4-8 I can use all 4 GPUs. Despite all this, the temperature of the GPUs sticks at 70-75 with no more than 60% utilisation on each of the 3 GPUs.

Is this normal - I would have thought I could train batches of more generous proportions with this hardware.

Thanks.

GhostRider
  • 2,109
  • 7
  • 35
  • 53

1 Answers1

1

That looks like some hardware problem. But check also various logs (output of dmesg, some /var/log/*log file).

Perhaps your PSU is slightly undersized.

Perhaps your cooling is insufficient and your computer is getting too hot. Is it sitting in some air-conditioned room?

NVIDIA GPUs are rumored to get quite hot.

If you have an ordinary desktop box, try to remove some cover plate to slightly lower the temperature (perhaps also open the window, if it is winter and cold enough outside). Check that your fans are working well (perhaps some BIOS settings...).

Use also some utilities (like yacpi, xsensors, etc...) to measure the temperature at several points (GPU, CPU, box, motherboard, whatever you can...).

Run also some GPU benchmarks (or code some easy ones in CUDA or in OpenCL) to load your GPU hardware. Be sure to test for failure of any GPU related code.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547