1

I have huge amount of numpy arrays that do not fit in RAM. Lets say millions of:

np.arange(10) 
  1. I want to save them on the file system in a single file, chunk by chunk.
  2. I want to read them from the file and feed them to my keras model using model.fit_generator

I read about dask which works with large data that does not fit in memory but could not manage to achieve my goals.

John
  • 329
  • 1
  • 20
  • All the same size, or differing? – hpaulj Mar 07 '19 at 17:27
  • Have you considered HDF5 file, with h5py or pytables module? – kcw78 Mar 07 '19 at 17:36
  • @hpaulj all numpy arrays represent images 224x224x3 so their size should be the same @ kcw78, I first considered using numpy.savez_compressed, but saw it does not have append method - I plan to put all the arrays in the same file. I am now looking at hdf5 – John Mar 08 '19 at 07:59

1 Answers1

1

Write your files to Disk with pickle:

pickle.dump((x, y), open(file, "wb"), protocol=pickle.HIGHEST_PROTOCOL)

Then create a list of test and train files and create a generator:

def raw_generator(files):
    while 1:      
        for file_num, file in enumerate(files):
            try:
                x, y = pickle.load(open(file, 'rb'))                   
                batches = int(np.ceil(len(y) / batch_size))
                for i in range(0, batches):                        
                    end = min(len(x), i * batch_size + batch_size)
                    yield x[i * batch_size:end], y[i * batch_size:end]

            except EOFError:
                print("error" + file)

train_gen = preprocessing.generator(training_files)
test_gen = preprocessing.generator(test_files)

Finally call fit_generator:

history = model.fit_generator(
                generator=train_gen,
                steps_per_epoch= (len(training_files)*data_per_file)/batch_size,
                epochs=epochs
                validation_data=test_gen,
                validation_steps=(len(test_files)*data_per_file)/batch_size,        
                use_multiprocessing=False,
                max_queue_size=10,
                workers=1,
                verbose=1)
ixeption
  • 1,972
  • 1
  • 13
  • 19
  • You can use what ever serlialization you want, it does not really change how to do it. Pickle is fast enough imho – ixeption Mar 08 '19 at 08:38
  • I deleted my comment, sorry, did not noticed your answer. My question was "is pickle fast enough" and above is @ixeption's answer – John Mar 08 '19 at 08:43
  • Please accept answers, if you think the answer is correct – ixeption Mar 08 '19 at 10:44