I am having some difficulty understanding exactly why the GPU and CPU speeds are similar with networks of small size (CPU is sometimes faster), and GPU is faster with networks of larger size. The code at the bottom of the question runs in 103.7s on an i7-6700k, but when using tensorflow-gpu, the code runs in 29.5 seconds.
However, when I train a network that has 100 hidden neurons, instead of 1000 like in the example below, I get ~20 seconds when using the GPU, and ~15 seconds when using the CPU.
I read on another stack overflow answer that CPU->GPU transfers take long, I'm assuming this is in reference to loading the data examples on the GPU.
Can someone explain why this occurs, and possibly reference some change in the code that I can make to maximize speed?
import numpy as np
import tensorflow as tf
import keras
from keras.models import Sequential
from keras.utils import np_utils
from keras.layers.core import Dense, Activation, Flatten, Dropout
from sklearn.preprocessing import normalize
## Importing the MNIST dataset using Keras
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# reshape for vector input
N, x, y = X_train.shape
X_train = normalize(np.reshape(X_train, (N, x * y)))
N, x, y = X_test.shape
X_test = normalize(np.reshape(X_test, (N, x * y)))
# one-hot encoding
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
model = Sequential()
model.add(Dense(output_dim=750, input_dim=784))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(150))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(50))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(50))
model.add(Activation('relu'))
model.add(Dropout(0.2))
model.add(Dense(10))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='Nadam', metrics=['accuracy'])
fit = model.fit(X_train, y_train, batch_size=128, nb_epoch=10, verbose=0)
## Printing the accuracy of our model, according to the loss function specified in model.compile above
score = model.evaluate(X_test, y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])