15

I am having some difficulty understanding exactly why the GPU and CPU speeds are similar with networks of small size (CPU is sometimes faster), and GPU is faster with networks of larger size. The code at the bottom of the question runs in 103.7s on an i7-6700k, but when using tensorflow-gpu, the code runs in 29.5 seconds.

However, when I train a network that has 100 hidden neurons, instead of 1000 like in the example below, I get ~20 seconds when using the GPU, and ~15 seconds when using the CPU.

I read on another stack overflow answer that CPU->GPU transfers take long, I'm assuming this is in reference to loading the data examples on the GPU.

Can someone explain why this occurs, and possibly reference some change in the code that I can make to maximize speed?

import numpy as np
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.utils import np_utils
from keras.layers.core import Dense, Activation, Flatten, Dropout
from sklearn.preprocessing import normalize

## Importing the MNIST dataset using Keras
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# reshape for vector input
N, x, y = X_train.shape
X_train = normalize(np.reshape(X_train, (N, x * y)))

N, x, y = X_test.shape
X_test = normalize(np.reshape(X_test, (N, x * y)))

# one-hot encoding
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)

model = Sequential()
model.add(Dense(output_dim=750, input_dim=784))
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(150))
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(50))
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(50))
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(10))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer='Nadam', metrics=['accuracy'])

fit = model.fit(X_train, y_train, batch_size=128, nb_epoch=10, verbose=0)

## Printing the accuracy of our model, according to the loss function specified in model.compile above
score = model.evaluate(X_test, y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])
Enrico Borba
  • 1,877
  • 2
  • 14
  • 26
  • 1
    What GPU are you using? Note that to completely saturate a top-of-the line GPU requires tens of thousands of threads. Assuming each thread handles the computation of one neuron, a system with 100 neurons would be underutilizing the GPU. Conversely, if you were to increase the number of neurons to, say 10K, the relative advantage of the GPU vs the CPU is likely to increase further. – njuffa Feb 07 '17 at 18:41
  • Whoops, totally forgot to include that in the answer. I have a GTX 1070. And I see. That makes sense – Enrico Borba Feb 07 '17 at 18:42
  • I actually noticed the same behaviour on my GTX 1070 GPU. I don't see any difference between running my model (which has similar dimensions to the one you are using) on CPU (i7-7700) and the GPU. Need to try to increase the capacity of the network to evaluate the difference – mspadaccino Oct 04 '17 at 07:36

1 Answers1

14

In case of tiny networks batch loading may be the culprit here.

Keras is loading each minibatch from RAM to GPU at the start of each iteration, thus creating a bottleneck in tiny networks (where forward/backward computation is very quick).
You can try using model.fit_generator instead of plain fit, so that CPU thread which loads minibatches works in parallel.

Unfortunately, there is no way I am aware of to preload the whole dataset on GPU for Keras (see my issue)

If you're using Tensorflow backend, you can use Google Timeline profiling tool to see what causes the slowdowns. For the reference, see this issue

  • 1
    Thanks, batch loading was the issue for me. Now it runs much faster. – Jonas Sourlier May 15 '18 at 12:12
  • can you explain me how to do a good generator as you described? – volperossa May 19 '18 at 18:42
  • 1
    Not sure that I understand what do you mean by good, there are a couple of examples searchable with google like this: https://www.kaggle.com/ezietsman/simple-keras-model-with-data-generator – Alexander Serikov May 20 '18 at 19:14
  • 1
    Worth mentioning, slow performance on GPU sometimes can be solved by using a cudnn layer, see this question: https://stackoverflow.com/questions/41948406/why-is-my-gpu-slower-than-cpu-when-training-lstm-rnn-models – Guy s Aug 05 '19 at 12:52