3

I am trying to run StyleGAN2 using a cluster equipped with eight GPUs (NVIDIA GeForce RTX 2080). At present, I am using the following configuration in training_loop.py:

minibatch_size_dict     = {4: 512, 8: 256, 16: 128, 32: 64, 64: 32},       # Resolution-specific overrides.
minibatch_gpu_base      = 8,        # Number of samples processed at a time by one GPU.
minibatch_gpu_dict      = {},       # Resolution-specific overrides.
G_lrate_base            = 0.001,    # Learning rate for the generator.
G_lrate_dict            = {},       # Resolution-specific overrides.
D_lrate_base            = 0.001,    # Learning rate for the discriminator.
D_lrate_dict            = {},       # Resolution-specific overrides.
lrate_rampup_kimg       = 0,        # Duration of learning rate ramp-up.
tick_kimg_base          = 4,        # Default interval of progress snapshots.
tick_kimg_dict          = {4:10, 8:10, 16:10, 32:10, 64:10, 128:8, 256:6, 512:4}): # Resolution-specific overrides.

I am training using a set of 512x52 pixel images. After a couple of iterations, I get the error message reported below and it looks like the script stops running (using watch nvidia-smi, we have that both the temperature and the fan activity for the GPUs decreases). I already reduced the batch size but it looks like the problem is somewhere else. Do you have any tip on how to fix this?

I was able to run StyleGAN with the same dataset. In the paper they say that StyleGAN2 should be less heavy, so I am a bit surprised.

Here is the error message I get:

2019-12-16 18:22:54.909009: E tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 334.11M (350338048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-12-16 18:22:54.909087: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 129.00MiB (rounded to 135268352).  Current allocation summary follows.
2019-12-16 18:22:54.918750: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **_***************************_*****x****x******xx***_******************************_***************
2019-12-16 18:22:54.918808: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_grad_input_ops.cc:903 : Resource exhausted: OOM when allocating tensor with shape[4,128,257,257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
albus_c
  • 6,292
  • 14
  • 36
  • 77
  • I'm seeing the same type of error:, I've put my notes here if you want to see what I've tried so far: https://datascience.stackexchange.com/questions/74666/is-it-possible-to-train-stylegan2-with-a-custom-dataset-using-a-graphics-card-th – slim May 23 '20 at 17:07
  • Did you recently downgrade from tensorflow 2.0 to version 1.15? – slim May 23 '20 at 20:17
  • What's your minibatch_size_base? – slim May 24 '20 at 13:24

2 Answers2

2

The config-f model for StyleGAN2 is actually bigger than StyleGAN1. Try using a less VRAM consuming configuration like config-e. You can actually change the configuration of the model by passing a flag in your python command like so: https://github.com/NVlabs/stylegan2/blob/master/run_training.py#L144

In my case, I'm able to train StyleGAN2 with config-e on 2 RTX 2080ti.

Syzygy
  • 402
  • 1
  • 3
  • 15
1

One or more high-end NVIDIA GPUs, NVIDIA drivers, CUDA 10.0 toolkit and cuDNN 7.5. To reproduce the results reported in the paper, you need an NVIDIA GPU with at least 16 GB of DRAM.

Your NVIDIA GeForce RTX 2080 card has 11GB, but I guess you're saying you have 8 of them? I don't think tensorflow is setup for parallelism out of the box.

slim
  • 2,545
  • 1
  • 24
  • 38