Keras: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[26671,32,32,64]

Question

I am training my network using Keras on tensorflow backend(Keras version 2.1), I have tried many things available on internet, but did not find any solution.

My Training set and labels: 26721(each image have size (32, 32,1)) , (26721, 10) 
Validation set and labels:  6680(each image have size(32,32,1), (6680,10)

This is my model so far, I am using Python3.

def CNN(input_, num_classes):

model = Sequential()

model.add(Convolution2D(16, kernel_size=(7, 7),  border_mode='same',
                 input_shape=input_))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(1, 1) ,  border_mode='same' ))
model.add(Convolution2D(64, (3, 3),  padding ='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(1,1),  border_mode='same' ))
model.add(Flatten())
model.add(Dense(96))
model.add(Activation('relu'))

model.add(Dense(num_classes))
model.add(Activation('softmax'))
return model

model = CNN(image_size, num_classes)

model.compile(loss=keras.losses.categorical_crossentropy,
          optimizer=keras.optimizers.SGD(lr=0.01),
          metrics=['accuracy'])

print(model.summary())
csv_logger = CSVLogger('training.log')
early_stop = EarlyStopping('val_acc', patience=200, verbose=1)
model_checkpoint = ModelCheckpoint(model_save_path,
                                    'val_acc', verbose=0,
                                    save_best_only=True)

model_callbacks = [early_stop, model_checkpoint, csv_logger]
# print "len(train_dataset) ", len(train_dataset)
print("int(len(train_dataset)/batch_size) ", int(len(train_dataset)/batch_size))
K.get_session().run(tf.global_variables_initializer())
 model.fit_generator(train,
              steps_per_epoch=np.ceil(len(train_dataset)/batch_size),
              epochs=num_epochs,
              verbose=1,
              validation_data=valid,
              validation_steps=batch_size,
              callbacks=model_callbacks)

Model Summary:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d_1 (Conv2D)            (None, 32, 32, 16)        800       
_________________________________________________________________
batch_normalization_1 (Batch (None, 32, 32, 16)        64        
_________________________________________________________________
activation_1 (Activation)    (None, 32, 32, 16)        0         
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 32, 32, 16)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 32, 32, 64)        9280      
_________________________________________________________________
batch_normalization_2 (Batch (None, 32, 32, 64)        256       
_________________________________________________________________
activation_2 (Activation)    (None, 32, 32, 64)        0         
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 32, 32, 64)        0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 65536)             0         
_________________________________________________________________
dense_1 (Dense)              (None, 96)                6291552   
_________________________________________________________________
activation_3 (Activation)    (None, 96)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 10)                970       
_________________________________________________________________
activation_4 (Activation)    (None, 10)                0         
=================================================================
Total params: 6,302,922
Trainable params: 6,302,762
Non-trainable params: 160

I am sending images according to batch size. This is my generator function:

# Generate images according to batch size


def gen(dataset, labels, batch_size):

images = []
digits = []
i = 0
while True:
    images.append(dataset[i])
    digits.append(labels[i]) 
    i+=1
    if i == batch_size:
        yield (np.array(images), np.array(digits))
        images = []
        digits = []
    # Generate remaining images also
    if i == len(dataset):
        yield (np.array(images), np.array(digits))
        images, digits = [], []
        i = 0

   train = gen(train_data, train_labels, batch_size)
   valid = gen(valid_data, valid_lables, batch_size)

Error log on terminal:

Please check this link for complete error: Terminal Output

Can anyone please help me, What I am doing wrong here?

Thanks in advance

score 3 · Accepted Answer · answered Dec 15 '17 at 19:04

3

You are training your network on your entire train set, which is too big to fit in memory, and too large for your gpu.

The standard in machine learning is to create small batches of your data and train on those. Batch sizes are usually 16, 32, 64 or some other power of two, but it can be anything, you usually have to find the correct batch size through cross validation.

answered Dec 15 '17 at 19:04

Hasnain Raza

681
5
10

I am sending in batch of 100 images, I have edited gen function in post, I am using generator to send images to network. – Lucky Dec 15 '17 at 21:30
1

Your generator function is faulty, because somewhere it yields a batch of 26671 images. Print the size of the batch every iteration to see where it goes wrong – Hasnain Raza Dec 15 '17 at 21:35
I have exactly the same configuration that worked for me before: batch size = 8, and now, it does not work. – SalahAdDin Mar 03 '20 at 12:14

score 0 · Answer 2 · answered Dec 17 '17 at 19:31

From the logs you can see that before allocating edge_1094_loss the memory is already full. Check the values Limit ad InUse.

This is perhaps because the memory is consumed by older models. Quick hack to solve this is to simply kill the process. This will release all the memory consumed by older models which are somehow not garbage collected.

Keras: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[26671,32,32,64]

2 Answers2