0

I have a very large dataset that does not fit into memory I split it into files and I want to use them in a data generator for training I use the following code

def csv_image_generator(i,inputPath1, bs, lb, mode="train", aug=None):

    # open the CSV file for reading
    # loop indefinitely
    while True:
        f = open('mnist_1D_train_'+str(i)+'.csv', "r")

        # initialize our batches of images and labels
        print(i)
        print('mnist_1D_train_'+str(i)+'.csv')
        images = []
        labels = []
        # keep looping until we reach our batch size
        while len(images) < bs:
            # attempt to read the next line of the CSV file
            line = f.readline()
            # check to see if the line is empty, indicating we have
            # reached the end of the file
            if line == "":
                # reset the file pointer to the beginning of the file
                # and re-read the line
                f.seek(0)
                line = f.readline()

                # if we are evaluating we should now break from our
                # loop to ensure we don't continue to fill up the
                # batch from samples at the beginning of the file
#               if mode == "eval":
#                   break
            # extract the label and construct the image
            line = line.strip().split(",")
            label = line[0]
            image = np.array([float(x) for x in line[1:]], dtype="float")
            image = image.reshape((1, 28, 28))
            image = image.T

            # update our corresponding batches lists
            images.append(image)
            labels.append(label)
        # one-hot encode the labels
        labels = lb.transform(np.array(labels))
        # if the data augmentation object is not None, apply it
        if aug is not None:
            (images, labels) = next(aug.flow(np.array(images),
                labels, batch_size=bs))
        # yield the batch to the calling function
        yield (np.array(images), labels)

create the label binarizer for one-hot encoding labels, then encode the testing labels construct the training image generator for data augmentation, initialize both the training and testing image generators

lb = LabelBinarizer()
lb.fit(list(labels))
testLabels = lb.transform(testLabels)

trainGen = csv_image_generator(TRAIN_CSV, BS, lb,   mode="train", aug=aug)
testGen  = csv_image_generator_test(TEST_CSV,  BS, lb,  mode="train", aug=None)

and then I use

H = model.fit_generator(
    trainGen,
    steps_per_epoch=NUM_TRAIN_IMAGES // (BS*2),
    validation_data=testGen,
    validation_steps=NUM_TEST_IMAGES // (BS*2),
    epochs=NUM_EPOCHS)

but fit_generator reads the first file only

  • 1
    You might be interested in the existing [MNIST dataset available in Keras](https://keras.io/datasets/). – nuric Feb 16 '19 at 22:37
  • 1
    Please show the definitions of your `trainGen` & `testGen`. Also - is the "very large dataset that does not fit into memory" the MNIST one? – desertnaut Feb 16 '19 at 23:57
  • i added train and test generators. here mnist is just an example , actual database is the raw waveforms of google's speech command dataset. I want to use it for end-2-end speech recognition. loading takes too much time, i split it into smaller files per class – cevahir parlak Feb 17 '19 at 09:47

0 Answers0