GPU out of memory when training convolutional neural network on Tensorflow

Question

I'm using a convolutional neural network to train a set of ~9000 images (300x500) on a GTX1080 Ti using Tensorflow 1.9 but running into an issue of exceeding the memory every time. I get a warning about the system memory being exceeded by 10% and after a few minutes the process gets killed. My code is below.

import tensorflow as tf
from os import listdir

train_path = '/media/NewVolume/colorizer/img/train/'  
col_train_path = '/media/NewVolume/colorizer/img/colored/train/'
val_path = '/media/NewVolume/colorizer/img/val/'
col_val_path = '/media/NewVolume/colorizer/img/colored/val/'

def load_image(image_file):
    image = tf.read_file(image_file)
    image = tf.image.decode_jpeg(image)
    return image

train_dataset = []
col_train_dataset = []
val_dataset = []
col_val_dataset = []

for i in listdir(train_path): 
    train_dataset.append(load_image(train_path + i))
    col_train_dataset.append(load_image(col_train_path + i))

for i in listdir(val_path): 
    val_dataset.append(load_image(val_path + i))
    col_val_dataset.append(load_image(col_val_path + i))

train_dataset = tf.stack(train_dataset)
col_train_dataset = tf.stack(col_train_dataset)
val_dataset = tf.stack(val_dataset)
col_val_dataset = tf.stack(col_val_dataset)

input1 = tf.placeholder(tf.float32, [None, 300, 500, 1])
color = tf.placeholder(tf.float32, [None, 300, 500, 3])

#MODEL

conv1 = tf.layers.conv2d(inputs = input1, filters = 8, kernel_size=[5, 5], activation=tf.nn.relu, padding = 'same')
pool1 = tf.layers.max_pooling2d(inputs = conv1, pool_size=[2, 2], strides=2)
conv2 = tf.layers.conv2d(inputs = pool1, filters = 16, kernel_size=[5, 5], activation=tf.nn.relu, padding = 'same')
pool2 = tf.layers.max_pooling2d(inputs = conv2, pool_size=[2, 2], strides=2)
conv3 = tf.layers.conv2d(inputs = pool2, filters = 32, kernel_size=[5, 5], activation=tf.nn.relu, padding = 'same')
pool3 = tf.layers.max_pooling2d(inputs = conv3, pool_size=[2, 2], strides=2)

flat = tf.layers.flatten(inputs = pool3)
dense = tf.layers.dense(flat, 2432, activation = tf.nn.relu)
reshaped = tf.reshape(dense, [tf.shape(dense)[0],38, 64, 1])

conv_trans1 = tf.layers.conv2d_transpose(inputs = reshaped, filters = 32, kernel_size=[5, 5], activation=tf.nn.relu, padding = 'same')
upsample1 = tf.image.resize_nearest_neighbor(conv_trans1, (2*tf.shape(conv_trans1)[1],2*tf.shape(conv_trans1)[2]))

conv_trans2 = tf.layers.conv2d_transpose(inputs = upsample1, filters = 16, kernel_size=[5, 5], activation=tf.nn.relu, padding = 'same')
upsample2 = tf.image.resize_nearest_neighbor(conv_trans2, (2*tf.shape(conv_trans2)[1],2*tf.shape(conv_trans2)[2]))
conv_trans3 = tf.layers.conv2d_transpose(inputs = upsample2, filters = 8, kernel_size=[5, 5], activation=tf.nn.relu, padding = 'same')
upsample3 = tf.image.resize_nearest_neighbor(conv_trans3, (2*tf.shape(conv_trans3)[1],2*tf.shape(conv_trans3)[2]))

conv_trans4 = tf.layers.conv2d_transpose(inputs = upsample3, filters = 3, kernel_size=[5, 5], activation=tf.nn.relu, padding = 'same')

reshaped2 = tf.reshape(dense, [tf.shape(conv_trans4)[0],300,500,3])

#TRAINING

loss = tf.losses.mean_squared_error(color, reshaped2)
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)

EPOCHS = 10
BATCH_SIZE = 3

dataset = tf.data.Dataset.from_tensor_slices((train_dataset,col_train_dataset)).repeat().batch(BATCH_SIZE)
iterator = dataset.make_one_shot_iterator()

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(EPOCHS):
        x,y=iterator.get_next()
        _, loss_value = sess.run([train_step, loss],feed_dict={input1:x.eval(session=sess),color:y.eval(session=sess)})
        print("Iter: {}, Loss: {:.4f}".format(i, loss_value))

If you get the same error even with batch size 1, it means you are loading way too many images that your RAM cannot handle them all. Consider using tfrecord. — fractals, Aug 20 '18 at 15:38
The code in question has input shape [300,500,1] and batch size of 3. The 1080 Ti card has some 12GB of RAM on it. I am pretty sure they are not loading too many images. But I do suspect that their network itself might be too large. — Mad Wombat, Aug 20 '18 at 15:47
Could you put some summary of the model, like `model.summary()` in Keras. How many trainable parameters do you have? — null, Aug 20 '18 at 16:05

score 1 · Answer 1 · answered Aug 20 '18 at 16:13

I think your problem is in the following bit of code.

def load_image(image_file):
    image = tf.read_file(image_file)
    image = tf.image.decode_jpeg(image)
    return image
...

for i in listdir(train_path): 
    train_dataset.append(load_image(train_path + i))
    col_train_dataset.append(load_image(col_train_path + i))

You are trying to use TF tensor operations as regular code. But what you end up with are nodes on the graph that only get evaluated in a session. In this case, you are trying to load every image in both your training and your evaluation dataset into your GPU memory (since your session runs on a GPU). I am guessing you have more images than your GPU has memory.

There are multiple solutions for this problem. You can make tf.read_image operation a part of your graph and pass image names for each batch as a feed dict in a training loop. You can build a proper input pipeline where loading of file names, batching and file data would be dealt with in graph or you could load images into numpy arrays using some external library and feed numpy arrays into the graph.

GPU out of memory when training convolutional neural network on Tensorflow

1 Answers1