How to work with big dataset for multi-label image classification in terms of memory and batches

Question

I am working on a dataset of 300K images doing multi class image classification. So far i took a small dataset of around 7k images, but the code either returns memory error or my notebook just dies. The code below converts all images to a numpy array at once, which results in trouble with my memory when the last row of code gets executed. train.csv contains image-filenames and one hot encoded labels. The code is like this:

data = pd.read_csv('train.csv')

img_width = 400
img_height = 400

img_vectors = []

for i in range(data.shape[0]):
    path = 'Images/' + data['Id'][
    img = image.load_img(path, target_size=(img_width, img_height, 3))
    img = image.img_to_array(img)
    img = img/255.0
    img_vectors.append(img)

img_vectors = np.array(img_vectors)

Error Message:

MemoryError                               Traceback (most recent call last)
<ipython-input-13-dd2302ae54e1> in <module>
----> 1 img_vectors = np.array(img_vectors)

MemoryError: Unable to allocate array with shape (7344, 400, 400, 3) and data type float32

I guess I need a batch of smaller arrays for all images to handle memory issue, to avoid having one array with all imagedata at the same time.

On an earlier project i did image-classification without multi-label with around 225k images. Anyway this code doesnt convert all image-data to one giant array. It rather puts the imagedata into smaller batches:

#image preparation
if K.image_data_format() is "channels_first":
    input_shape = (3, img_width, img_height)
else:
    input_shape = (img_width, img_height, 3)

train_datagen = ImageDataGenerator(rescale=1./255, horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(train_data_dir, target_size=(img_width, img_height), batch_size=batch_size, class_mode='categorical')
validation_generator = test_datagen.flow_from_directory(validation_data_dir, target_size=(img_width, img_height), batch_size=batch_size, class_mode='categorical')

model = Sequential()
model.add(Conv2D(32, (3,3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
...
model.add(Dense(17))
model.add(BatchNormalization(axis=1, momentum=0.6))
model.add(Activation('softmax'))

model.summary()    

model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size,
    class_weight = class_weight
)

So what i actually need is an approach of how I can handle big datasets of images for multilabel image classification without getting in trouble with memory. Ideal would be to work with a csv-file containing image-filename and one-hot-encoded labels in combination with array batches for learning.

Any help or guesses here would be greatly appreciated.

I don't quite get your question. You have provided the solution yourself in the second part: you need to use the image generator — MaximeKan, Jan 19 '20 at 11:50
the lower code is doing Image-classification without multilabel. i also dont provide csv data with one hot encoded labels. i dont know either how to implement image generator in first code, if thats the solution. — sebk, Jan 19 '20 at 12:14

score 1 · Answer 1 · answered Jan 19 '20 at 17:18

The easiest way to solve the problem you are facing is to write a costume data generator, here is a tutorial that shows how to do this. The idea is that instead of using flow_from_directory, you create generate a costume dataloader, that reads each image from its source path and gives to y the correspongind labels. Practiclly I think that your data is stored on a .csv file, where each row contain the path to an image, and the labels present in the image. So your datagen will have a function getittem(self, index) that will read the image from the path in raw number index and return along with the target that is obtained by reading the labels in this raw and one hot encode them, then sum them.

dear omar/hola, thanks for your reply. even if i havent managed to get my code correct, this was a very helpful hint. — sebk, Feb 19 '20 at 17:14

How to work with big dataset for multi-label image classification in terms of memory and batches

1 Answers1