Tensorflow 2: GPU training is not faster than CPU training

Question

Recently I've been fiddling around with tensorflow 2.0. I had not used it in 2-3 years, but I know that I can use my GPU. I had most recently worked with PyTorch and when comparing my computer vs someone else that didn't have a GPU and was using Colab, it was night and day. My machine was so much faster.

But for some reason, as I have been running some small tests, I felt like the training was going too slowly. And then when I checked the speed by switching devices to CPU, the speed was the same.

Some context about my machine. I set up a new conda environment to mirror the one for the Tensorflow Developer Exam. I'm running TF 2.9.0 with Python 3.8.0 on a GeForce RTX 2060. I'm also running this on Windows 10. I did not re-download and update my CUDA libraries from a few years ago, but I checked and tensorflow recognizes my GPU.

Here is the code for loading tensorflow and doing GPU checking

import tensorflow as tf

print(tf.__version__)
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
device_lib.list_local_devices()

And the result

2.9.0
Num GPUs Available:  1
[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 18213716215175288244
 xla_global_id: -1,
 name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 4160159744
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 14308843300195357737
 physical_device_desc: "device: 0, name: GeForce RTX 2060, pci bus id: 0000:01:00.0, compute capability: 7.5"
 xla_global_id: 416903419]

As you can see, the graphics card is being recognized. I did a basic NN regression test based on some youtube videos I have been watching. It is insurance data and it's pretty small. Only about 1000 training samples and 11 features after transformation. It's all simple numbers. No images or anything complicated. A very simple regression test.

Here is the data download and initial transformation

import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

# Read in the insurance data
insurance = pd.read_csv('https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv')

rom sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

# Create a column transformer
ct = make_column_transformer(
    (MinMaxScaler(), ['age', 'bmi', 'children']),
    (OneHotEncoder(), ['sex', 'smoker', 'region'])
)

# Create X and y
X = insurance.drop("charges", axis=1)
y = insurance.charges

# Build train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the column transformer to the training data
ct.fit(X_train)

# Transform training and test data with normalization and one hot encoding
X_train_trans = ct.transform(X_train)
X_test_trans = ct.transform(X_test)

And here is how I made my neural network. Again, I kept it very simple. Here I used the functional API since I'm trying to learn it, but the speed ends up being the same with the Sequential API.

tf.random.set_seed(42)

inputs = tf.keras.Input(shape=X_train_trans[1].shape)
x = tf.keras.layers.Dense(128, activation='relu')(inputs)
x = tf.keras.layers.Dense(64, activation='relu')(x)
outputs = tf.keras.layers.Dense(1)(x)

ins_model_4 = tf.keras.Model(inputs, outputs)

ins_model_4.compile(loss='mae', 
                    optimizer='adam', 
                    metrics=['mae'])

history = ins_model_4.fit(X_train_trans, y_train, epochs=200, verbose=1)

As you can see, it's a very shallow model. But for some reason it took 30 seconds to train. That felt too long. It should be blazing fast. So I then ran this with tf.device selected for both cpu and gpu like this.

# with gpu selected
with tf.device('/gpu:0'):
    history = ins_model_4.fit(X_train_trans, y_train, epochs=200, verbose=1)

# with cpu selected
with tf.device('/cpu:0'):
    history = ins_model_4.fit(X_train_trans, y_train, epochs=200, verbose=1)

And I found the results are the same. What is going on here? I have a few guesses.

Do I need to download the new CUDA files? Is it possible for TF to recognize a gpu but not utilize it? Is there something about the data I am using or the regression problem I have defined that a re for some reason slow on my network? Did I code the tf.device stuff wrong? I could really use some help resolving this situation.

are you able to verify that the gpu is being used by running nvida-smi while your code runs? — xdhmoore, Apr 21 '23 at 01:40
I just checked while re-retraining the model. I'm going to be honest. I'm not sure what I'm supposed to be looking at. I saw that Volatile GPU-Util shot up from 0% to 10%. I also noticed something else. I looked into task manager and I saw a GPU usage spike. But the spike was mostly in my onboard Intel GPU and not by RTX 2060. — Khachatur Mirijanyan, Apr 21 '23 at 01:49
@NotAName Hmm I guess it makes sense that the really small models don't utilize the GPU much. In hindsight, the PyTorch models I'm comparing my machine against to my friend's were really, really big taking hours for him and about 10 minutes for me. — Khachatur Mirijanyan, Apr 21 '23 at 01:56
Yes, I don't believe you should expect any performance improvement from a simple two-layer neural net. — NotAName, Apr 21 '23 at 01:57

Tensorflow 2: GPU training is not faster than CPU training

0 Answers0