Save large amount of numpy arrays in single file and use it to fit keras model

Question

I have huge amount of numpy arrays that do not fit in RAM. Lets say millions of:

np.arange(10)

I want to save them on the file system in a single file, chunk by chunk.
I want to read them from the file and feed them to my keras model using model.fit_generator

I read about dask which works with large data that does not fit in memory but could not manage to achieve my goals.

Have you considered HDF5 file, with h5py or pytables module? — kcw78, Mar 07 '19 at 17:36
@hpaulj all numpy arrays represent images 224x224x3 so their size should be the same @ kcw78, I first considered using numpy.savez_compressed, but saw it does not have append method - I plan to put all the arrays in the same file. I am now looking at hdf5 — John, Mar 08 '19 at 07:59

score 1 · Accepted Answer · answered Mar 07 '19 at 16:27

Write your files to Disk with pickle:

pickle.dump((x, y), open(file, "wb"), protocol=pickle.HIGHEST_PROTOCOL)

Then create a list of test and train files and create a generator:

def raw_generator(files):
    while 1:      
        for file_num, file in enumerate(files):
            try:
                x, y = pickle.load(open(file, 'rb'))                   
                batches = int(np.ceil(len(y) / batch_size))
                for i in range(0, batches):                        
                    end = min(len(x), i * batch_size + batch_size)
                    yield x[i * batch_size:end], y[i * batch_size:end]

            except EOFError:
                print("error" + file)

train_gen = preprocessing.generator(training_files)
test_gen = preprocessing.generator(test_files)

Finally call fit_generator:

history = model.fit_generator(
                generator=train_gen,
                steps_per_epoch= (len(training_files)*data_per_file)/batch_size,
                epochs=epochs
                validation_data=test_gen,
                validation_steps=(len(test_files)*data_per_file)/batch_size,        
                use_multiprocessing=False,
                max_queue_size=10,
                workers=1,
                verbose=1)

You can use what ever serlialization you want, it does not really change how to do it. Pickle is fast enough imho — ixeption, Mar 08 '19 at 08:38
I deleted my comment, sorry, did not noticed your answer. My question was "is pickle fast enough" and above is @ixeption's answer — John, Mar 08 '19 at 08:43

Save large amount of numpy arrays in single file and use it to fit keras model

1 Answers1