0

I am training a ResNet-50 network on a large database. When checking the percentage of use of my GPU, I found it varying just between 0% and 4%! although I am using tensorflow-GPU. Here is my CPU and GPU rate of use: enter image description here

When I run these two command lines:

 from tensorflow.python.client import device_lib
 print(device_lib.list_local_devices())

I get

 [name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
  }
 incarnation: 4622338339054789933
 , name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 13594420839
 locality {
 bus_id: 1
 links {
 }
 }
 incarnation: 17927686236275886371
 physical_device_desc: "device: 0, name: Quadro 
 P5000, pci bus id: 0000:01:00.0, compute 
 capability: 6.1"
 ]

and when I run nvidia-smi I get enter image description here

Could anyone help me with an easy explanation how to correctly and totally exploit my GPU? I have to mention that I am using ImageDataGenerator object during my training with its two methods flow_from_directory and fit_generator, so could I set specific parameters such as workers parameter to enhance my GPU rate of use. Here is how I am using ImageDataGenerator

  input_imgen = ImageDataGenerator()
  train_it = input_imgen.flow_from_directory(directory=data_path_l,target_size= 
  (224,224),
                                      color_mode="rgb",
                                      batch_size=batch_size,
                                      class_mode="categorical",
                                      shuffle=False,
                                      )

  valid_it = input_imgen.flow_from_directory(directory=test_data_path_l,target_size= 
  (224,224),
                                      color_mode="rgb",
                                      batch_size=batch_size,
                                      class_mode="categorical",
                                      shuffle=False,
                                      )

  model = resnet.ResnetBuilder.build_resnet_50((img_channels, img_rows, 
  img_cols), num_classes)
  model.compile(loss='categorical_crossentropy',
          optimizer='adam',
          metrics=['accuracy'])

  filepath=".\conv2D_models\weights-improvement-{epoch:02d}- 
  {val_acc:.2f}.hdf5"

  mc = ModelCheckpoint(filepath, save_weights_only=False, verbose=1, 
  monitor='loss', mode='min')

  history=model.fit_generator(train_it,
                    steps_per_epoch= train_images // batch_size,
                    validation_data = valid_it, 
                    validation_steps = val_images// batch_size,
                    epochs = epochs,callbacks=[mc],
                    shuffle=False)
E.gh
  • 85
  • 2
  • 11
  • The most likely reason is that the model is too small and does not really need GPU training. Only large models require lots of compute and profit from using a GPU. – Dr. Snoopy Jun 07 '19 at 14:48
  • 1
    Possible reasons https://stackoverflow.com/questions/53887816/why-tensorflow-gpu-is-still-using-cpu – cho_uc Jun 07 '19 at 14:56
  • do you think that a ResNet-50 with 3D convolution layers trained on a large database (20K images) with a batch size=32 does not require a GPU? – E.gh Jun 07 '19 at 14:57
  • How do you use `ImageDataGenerator`? – Sharky Jun 07 '19 at 15:01
  • @Sharky please take a look to my question edit. – E.gh Jun 07 '19 at 15:14
  • Do you have any preprocessing inside ImageDataGenerator ? – Sharky Jun 07 '19 at 15:57
  • What batch size are you using? – Dr. Snoopy Jun 07 '19 at 16:25
  • @Sharky no, I didn't use any type of preprocessing inside the generator – E.gh Jun 07 '19 at 18:06
  • @MatiasValdenegro I used 32 as batch size – E.gh Jun 07 '19 at 18:07
  • Seems like your processes data faster than you generate it. Have you considered using `tf.data` API? – Sharky Jun 07 '19 at 18:09
  • @Sharky no I have no idea about this API, how could this one help me here please? – E.gh Jun 07 '19 at 18:22
  • 1
    Please take a look https://www.tensorflow.org/guide/performance/datasets – Sharky Jun 07 '19 at 18:24
  • @Sharky Thank you, I will read this. – E.gh Jun 07 '19 at 18:38
  • Increase the batch size, that should increase utilization, try something like 256 – Dr. Snoopy Jun 07 '19 at 20:20
  • @MatiasValdenegro The problem is that when increasing the batch size to more than 32, with (224,224) target size, I got a memory error. although I have 32G RAM. – E.gh Jun 07 '19 at 21:29
  • Your hard drive utilization is sitting at a very high percentage, this is probably the culprit. Work to cache your data in memory, no GPU or `tf.data` pipeline can save you if you simply can't get the images off your disk fast enough. – Yolo Swaggins Jun 08 '19 at 04:28
  • @YoloSwaggins I think this is due to the training process as the capture of the source utilization is taken during the training. If I check that when no training is performed on the machine, I found that the hard drive utilization is just 32%. – E.gh Jun 08 '19 at 18:29

0 Answers0