Custom Datagenerator

Question

I have a custom file containing the paths to all my images and their labels which I load in a dataframe using:

MyIndex=pd.read_table('./MySet.txt')

MyIndex has two columns of interest ImagePath and ClassName

Next I do some train test split and encoding the output labels as:

images=[]
for index, row in MyIndex.iterrows():
    img_path=basePath+row['ImageName']
    img = image.load_img(img_path, target_size=(299, 299))
    img_path=None
    img_data = image.img_to_array(img)
    img=None
    images.append(img_data)
    img_data=None


images[0].shape

Classes=Sample['ClassName']
OutputClasses=Classes.unique().tolist()

labels=Sample['ClassName']
images=np.array(images, dtype="float") / 255.0
(trainX, testX, trainY, testY) = train_test_split(images,labels, test_size=0.10, random_state=42)
trainX, valX, trainY, valY = train_test_split(trainX, trainY, test_size=0.10, random_state=41)

images=None
labels=None

encoder = LabelEncoder()
encoder=encoder.fit(OutputClasses)
encoded_Y = encoder.transform(trainY)
# convert integers to dummy variables (i.e. one hot encoded)
trainY = to_categorical(encoded_Y, num_classes=len(OutputClasses))

encoded_Y = encoder.transform(valY)
# convert integers to dummy variables (i.e. one hot encoded)
valY = to_categorical(encoded_Y, num_classes=len(OutputClasses))

encoded_Y = encoder.transform(testY)
# convert integers to dummy variables (i.e. one hot encoded)
testY = to_categorical(encoded_Y, num_classes=len(OutputClasses))

datagen=ImageDataGenerator(rotation_range=90,horizontal_flip=True,vertical_flip=True,width_shift_range=0.25,height_shift_range=0.25)
datagen.fit(trainX,augment=True)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])


batch_size=128
model.fit_generator(datagen.flow(trainX,trainY,batch_size=batch_size), epochs=500, 
                    steps_per_epoch=trainX.shape[0]//batch_size,validation_data=(valX,valY))

The problem I face that the data loaded in one go is too large to fit in current machine memory and so I am unable to work with the complete dataset.

I have tried to work with the datagenerator but do not want to follow he directory conventions it follows and also cannot eradicate the augmentation part.

The question is that is there a way to load batches from the disk ensuring the two stated conditions.

Sample is just holding some information along with Class name for images — pure_virtual, Jun 28 '19 at 11:28
You may consider only the Classes as list of the classes @AKX — pure_virtual, Jun 28 '19 at 11:29

Nadav · Answer 1 · 2019-06-28T14:32:29.360

If you want to load from the disk it is convenient to do with ImageDataGenerator that you used.

There are two ways to do it. By stating the directory of the data with flow_from_directory. Alternatively you can use flow_from_dataframe with Pandas dataframe

If you want to have a list of paths you should not use a custom generator that yields batches of images. Here is a stub:

def load_image_from_path(path):
    "Loading and preprocessing"
    ...

def my_generator():
    length = df.shape[0]
    for i in range(0, length, batch_size)
        batch = df.loc[i:min(i+batch_size, length-1)]
        x, y = map(load_image_from_path, batch['ImageName']), batch['ClassName']
        yield x, y

Note: in fit_generator there is an additional generator named validation_data for well you guessed it - validation. One option is to pass the generators the indices to choose from in order to split train and test (assuming the data is shuffled, if not check this out).

Tayyab · Accepted Answer · 2019-06-28T11:51:58.843

I believe you should have a look at this post

What you are looking for is Keras flow_from_dataframe that let you load the batches from disk by providing the names of your files and their labels in a dataframe and also providing a top directory path that contains all your images.

Making a bit of midifications in your code and borrowing some from the link shared:

MyIndex=pd.read_table('./MySet.txt')

Classes=MyIndex['ClassName']
OutputClasses=Classes.unique().tolist()

trainDf=MyIndex[['ImageName','ClassName']]
train, test = train_test_split(trainDf, test_size=0.10, random_state=1)


#creating a data generator to load the files on runtime
traindatagen=ImageDataGenerator(rotation_range=90,horizontal_flip=True,vertical_flip=True,width_shift_range=0.25,height_shift_range=0.25,
    validation_split=0.1)
train_generator=traindatagen.flow_from_dataframe(
    dataframe=train,
    directory=basePath,#the directory containing all your images
    x_col='ImageName',
    y_col='ClassName',
    class_mode='categorical',
    target_size=(299, 299),
    batch_size=batch_size,
    subset='training'
)
#Also a generator for the validation data
val_generator=traindatagen.flow_from_dataframe(
    dataframe=train,
    directory=basePath,#the directory containing all your images
    x_col='ImageName',
    y_col='ClassName',
    class_mode='categorical',
    target_size=(299, 299),
    batch_size=batch_size,
    subset='validation'
)


STEP_SIZE_TRAIN=train_generator.n//train_generator.batch_size
STEP_SIZE_VALID=val_generator.n//val_generator.batch_size
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.fit_generator(generator=train_generator, steps_per_epoch=STEP_SIZE_TRAIN,
                    validation_data=val_generator,
                    validation_steps=STEP_SIZE_VALID,
                    epochs=500)

Also note now you do not need the encoding of the labels as you had in your original code and also omit the image loading code.

I have not tried this code itself so try to fix any bugs you may encounter, as the primary focus was to deliver you the basic idea.

In response to your comment: If you have all files in different directories then one solution would be to have your ImagesName to store the relative path including the intermediate directory in path something like './Dir/File.jpg' and then move all the directories to one folder and use the one as base path and everything else stays the same. Also looking at your code segment that loaded the files look like you already have file paths stored in ImageName column so the suggested approach should work for you.

images=[]
for index, row in MyIndex.iterrows():
    img_path=basePath+row['ImageName']
    img = image.load_img(img_path, target_size=(299, 299))
    img_path=None
    img_data = image.img_to_array(img)
    img=None
    images.append(img_data)
    img_data=None

In case if still some ambiguity exists feel free to ask again.

The post suggests that all files must be in same directory but i have files in the seperate folders so I believe this might not work for me. — pure_virtual, Jun 28 '19 at 11:38
@pure_virtual please see the new edit in response of your comment. — Tayyab, Jun 28 '19 at 11:54
Yes I have Image name containing the paths. like basePath/Dodge_Charger/1985/ref1.jpg for Dodge_Charger class. — pure_virtual, Jun 28 '19 at 11:56
Yes that is exactly my point, if in the described format the code should work fine for you. — Tayyab, Jun 28 '19 at 11:58
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/195688/discussion-between-pure-virtual-and-tayyab). — pure_virtual, Jun 28 '19 at 12:01

AKX · Answer 3 · 2019-06-28T11:59:59.193

I think the simplest way to do this would be to just load part of your images per each generator and repeatedly call .fit_generator() with that smaller batch.

~~This example uses `random.random()` to choose which images to load – you could use something more sophisticated.~~

The previous version used random.random(), but we can just as well use a start index and page size like in this revised version to loop over the list of images forever.

import itertools


def load_images(start_index, page_size):
    images = []
    for index in range(page_size):
        # Generate index using modulo to loop over the list forever
        index = (start_index + index) % len(rows)
        row = MyIndex[index]
        img_path = basePath + row["ImageName"]
        img = image.load_img(img_path, target_size=(299, 299))
        img_data = image.img_to_array(img)
        images.append(img_data)
    return images


def generate_datagen(batch_size, start_index, page_size):
    images = load_images(start_index, page_size)

    # ... everything else you need to get from images to trainX and trainY, etc. here ...

    datagen = ImageDataGenerator(
        rotation_range=90,
        horizontal_flip=True,
        vertical_flip=True,
        width_shift_range=0.25,
        height_shift_range=0.25,
    )
    datagen.fit(trainX, augment=True)
    return (
        trainX,
        trainY,
        valX,
        valY,
        datagen.flow(trainX, trainY, batch_size=batch_size),
    )


model.compile(
    loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"]
)

page_size = (
    500
)  # load 500 images at a time; change this as suitable for your memory condition

for page in itertools.count():  # Count from zero to forever.
    batch_size = 128
    trainX, trainY, valX, valY, generator = generate_datagen(
        128, page * page_size, page_size
    )
    model.fit_generator(
        generator,
        epochs=5,
        steps_per_epoch=trainX.shape[0] // batch_size,
        validation_data=(valX, valY),
    )
    # TODO: add a `break` clause with a suitable condition

It's there so you can choose a small enough batch size. For instance, if you estimate you can fit 20% of images in memory, set it to 0.2 and a random 20% is chosen every time. — AKX, Jun 28 '19 at 11:51
I think that not be very much useful if I want to train for all images, right? — pure_virtual, Jun 28 '19 at 11:53
It'd eventually likely train on all images. Anyway, sure, I can make things a little more stateful so it'll use all images. — AKX, Jun 28 '19 at 11:54
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/195689/discussion-between-pure-virtual-and-akx). — pure_virtual, Jun 28 '19 at 12:02

Custom Datagenerator

3 Answers3