OOm - cannot run StyleGAN2 despite reducing batch size

Question

I am trying to run StyleGAN2 using a cluster equipped with eight GPUs (NVIDIA GeForce RTX 2080). At present, I am using the following configuration in training_loop.py:

minibatch_size_dict     = {4: 512, 8: 256, 16: 128, 32: 64, 64: 32},       # Resolution-specific overrides.
minibatch_gpu_base      = 8,        # Number of samples processed at a time by one GPU.
minibatch_gpu_dict      = {},       # Resolution-specific overrides.
G_lrate_base            = 0.001,    # Learning rate for the generator.
G_lrate_dict            = {},       # Resolution-specific overrides.
D_lrate_base            = 0.001,    # Learning rate for the discriminator.
D_lrate_dict            = {},       # Resolution-specific overrides.
lrate_rampup_kimg       = 0,        # Duration of learning rate ramp-up.
tick_kimg_base          = 4,        # Default interval of progress snapshots.
tick_kimg_dict          = {4:10, 8:10, 16:10, 32:10, 64:10, 128:8, 256:6, 512:4}): # Resolution-specific overrides.

I am training using a set of 512x52 pixel images. After a couple of iterations, I get the error message reported below and it looks like the script stops running (using watch nvidia-smi, we have that both the temperature and the fan activity for the GPUs decreases). I already reduced the batch size but it looks like the problem is somewhere else. Do you have any tip on how to fix this?

I was able to run StyleGAN with the same dataset. In the paper they say that StyleGAN2 should be less heavy, so I am a bit surprised.

Here is the error message I get:

2019-12-16 18:22:54.909009: E tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 334.11M (350338048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-12-16 18:22:54.909087: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 129.00MiB (rounded to 135268352).  Current allocation summary follows.
2019-12-16 18:22:54.918750: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **_***************************_*****x****x******xx***_******************************_***************
2019-12-16 18:22:54.918808: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_grad_input_ops.cc:903 : Resource exhausted: OOM when allocating tensor with shape[4,128,257,257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

I'm seeing the same type of error:, I've put my notes here if you want to see what I've tried so far: https://datascience.stackexchange.com/questions/74666/is-it-possible-to-train-stylegan2-with-a-custom-dataset-using-a-graphics-card-th — slim, May 23 '20 at 17:07
Did you recently downgrade from tensorflow 2.0 to version 1.15? — slim, May 23 '20 at 20:17

score 2 · Answer 1 · answered May 31 '20 at 18:06

The config-f model for StyleGAN2 is actually bigger than StyleGAN1. Try using a less VRAM consuming configuration like config-e. You can actually change the configuration of the model by passing a flag in your python command like so: https://github.com/NVlabs/stylegan2/blob/master/run_training.py#L144

In my case, I'm able to train StyleGAN2 with config-e on 2 RTX 2080ti.

score 1 · Answer 2 · answered May 23 '20 at 23:47

1

One or more high-end NVIDIA GPUs, NVIDIA drivers, CUDA 10.0 toolkit and cuDNN 7.5. To reproduce the results reported in the paper, you need an NVIDIA GPU with at least 16 GB of DRAM.

Your NVIDIA GeForce RTX 2080 card has 11GB, but I guess you're saying you have 8 of them? I don't think tensorflow is setup for parallelism out of the box.

answered May 23 '20 at 23:47

slim

2,545
1
24
38

Correct, eight is the number of cards, not the memory in GB of mine ;-) – albus_c May 24 '20 at 10:31
I don't think the stylegan2 code you're using is written to run on multiple GPUs. – slim May 24 '20 at 14:03

OOm - cannot run StyleGAN2 despite reducing batch size

2 Answers2