Not seeing performance improvement when running TensorFlow on GPU

Question

I installed Cuda and cuDNN as per instructions on TF help page and it appears that everything is working correectly. If I print the available GPUs I get:

>>> print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

Out: Num GPUs Available:  1

Also when I start training the sequential model in the output I get that all necessary libraries have loded correctly and that a GPU device successfully created:

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 4733 MB memory) -> physical GPU (device: 0, name: GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)

But I'm not seeing any major improvements in training performance. It's about the same as it was before when training on the CPU and I'd assume that my RTX 3060 should provide a bit of a boost.

Should I be seeing an improvement when training a relatively simple Sequential model?

EDIT: If I disable GPU training and train on CPU only using:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

The training time of the model on CPU is 21.14 seconds, on GPU the training takes 57.59(!!!) seconds.

I also don't see GPU load increase as expected during training:

Also the code for the model I'm training:

import datetime as dt
# import os
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import tensorflow as tf
from tensorflow import keras
import numpy as np

EPOCHS = 50
BATCH_SIZE = 128
VERBOSE = 1
NB_CLASSES = 10  # Number of outputs
N_HIDDEN = 128
VALIDATION_SPLIT = 0.2
DROPOUT = 0.3

mnist = keras.datasets.mnist
(X_train, Y_train), (X_test, Y_test) = mnist.load_data()
# X_train is 60,000 rows of 28x28 values
# Reshape it to 60,000x784
RESHAPED = 784

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

X_train = X_train.reshape(60000, RESHAPED)
X_test = X_test.reshape(10000, RESHAPED)
X_train = X_train.astype('float32')
X_test = X_test.astype('float32')

# Normalize inputs between 0 and 1
X_train /= 255
X_test /= 255
print(X_train.shape[0], 'train samples')
print(X_test.shape[0], 'test samples')

# One-hot encoding of labels
Y_train = tf.keras.utils.to_categorical(Y_train, NB_CLASSES)
Y_test = tf.keras.utils.to_categorical(Y_test, NB_CLASSES)

# Build the model
model = tf.keras.models.Sequential()
model.add(keras.layers.Dense(N_HIDDEN, input_shape=(RESHAPED,),
          name='dense_layer', activation='relu'))
model.add(keras.layers.Dropout(DROPOUT))
model.add(keras.layers.Dense(N_HIDDEN, input_shape=(RESHAPED,),
          name='dense_layer2', activation='relu'))
model.add(keras.layers.Dropout(DROPOUT))
model.add(keras.layers.Dense(NB_CLASSES, input_shape=(RESHAPED,),
          name='dense_layer3', activation='softmax'))

# Print summary of the model
model.summary()

# Compiling the model
model.compile(optimizer='Adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

t = dt.datetime.now()
# Training the model
model.fit(X_train, Y_train, batch_size=BATCH_SIZE,
          epochs=EPOCHS, verbose=VERBOSE,
          validation_split=VALIDATION_SPLIT)

# Evaluate the model
test_loss, test_acc = model.evaluate(X_test, Y_test)
print('Test accuracy: ', test_acc)
print(f'Training elapsed: {dt.datetime.now()-t}')

Can you share the code (model, training loop) as well as the timing result on CPU and GPU? — Louis Lac, May 01 '21 at 09:34
@LouisLac, amended the question. It's actually *significantly slower* on GPU. No idea why. — NotAName, May 01 '21 at 09:41
did you check this one out? : https://stackoverflow.com/questions/42097115/keras-tensorflow-backend-slower-on-gpu-than-on-cpu-when-training-certain-netwo — aSaffary, May 01 '21 at 09:51
Concerning the GPU load, it's important to understand that GPUs are really performing at best when models are **shallow** and data is complex (MNIST is just 28x28 images). So i would guess that because the MNIST example is quite simple, GPU ressources aren't used at all. Plus to check if gpu is actually used i would recommand this : `tf.test.is_gpu_available()` — Skaddd, May 01 '21 at 09:52
@Skaddd, ok so for simple networks with low-dimensional inputs it's best to disable GPU. Got it. — NotAName, May 01 '21 at 09:55
For the mnist, you need to use a big enough model to keep GPU busy. — Innat, May 01 '21 at 11:11

score 1 · Answer 1 · answered May 01 '21 at 14:58

I'll just put an answer here in case it will be useful to anyone in the future. From the information provided in the comments and also answer to this post the slowness appears to be a result of a combination of couple factors.

For starters, on small matrices, matrix multiplication on CPU is significantly faster due to higher clock speed. Secondly, there's a significant overhead in transfering data between CPU and GPU and on smaller inputs any performance gains from GPU processing are eaten by the overhead.

As a result on MNIST dataset where input has a shape (784,) the processing times are as follows:

CPU - 21s

GPU - 57s

At the same time on IMDB dataset where input has a shape (10000,) the gains from GPU processing are now significant:

CPU - 4min 40s

GPU - 1min 23s

So for small inputs it's best to disable the GPU processing for faster fitting of the model using something like:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "-1"

Not seeing performance improvement when running TensorFlow on GPU

1 Answers1