How to create a tf.data pipeline with multiple .npy files

Question

I have looked into other issues on this problem but could not find the exact answer, so trying from scratch:

The problem

I have multiple .npy files (X_train files) each an array of shape (n, 99, 2) - only the first dimension differs, while the remaining two are the same. Based on the name of the .npy file I can also get corresponding labels (y_train files).

Every such couple of files can be loaded into memory easily (so do multiple files), but not all of them at once.

I built a generator that goes through the file list and aggregates a given number of files for the training batch:

def tf_data_generator(filelist, directory = [], batch_size = 5):
    i = 0
    x_t = os.listdir(directory[0])
    y_t = os.listdir(directory[1])
    while True:
        file_chunk = filelist[i*batch_size:(i+1)*batch_size] 
        X_a = []
        Y_a = []
        for fname in file_chunk:
            x_info = np.load(path_x_tr+fname)
            y_info = np.load(path_y_tr+fname)
            X_a.append(x_info)
            Y_a.append(y_info)
        X_a = np.concatenate(X_a)
        Y_a = np.concatenate(Y_a)
        yield X_a, Y_a
        i = i + 1

In practice (on CPU) it works fine, however it crashes if I am trying to use a GPU on CUDA, giving Failed to call ThenRnnForward with model config: error (see: link )

So I am trying to find another approach and use tf.data API for data generation. However, I am stuck:

def parse_file(name):
    x = np.load('./data/x_train_m/'+name)
    y = np.load('./data/y_train_m/'+name)
    train_dataset = tf.data.Dataset.from_tensor_slices((test1, test2))
    return train_dataset

train_dataset = parse_file('example1.npy')
train_dataset = train_dataset.shuffle(100).batch(64)

model = wtte_rnn()
model.summary()
K.set_value(model.optimizer.lr, 0.01)
model.fit(train_dataset,
          epochs=10)

This works well, however, I could not find a way to:

mix multiple files (up to a certain number, let's say five)
traverse through the whole list of files

I have read up on flat_map and interleave, however, I haven't been able to go any further and any attempt at using those was unsuccessful. How can I make a similar generator as in the upper portion of the code, ,but using tf.data API?

score 0 · Answer 1 · answered Feb 08 '21 at 15:22

0

You can try concatenating them, like this:

train_dataset = parse_file('example1.npy') # initialize train dataset

for file in files[1:]: # concatenate with the remaining files
    train_dataset = train_dataset.concatenate(parse_file(file))

answered Feb 08 '21 at 15:22

Nicolas Gervais

33,817
13
115
143

That would make sense if I could load all data in memory - however, I cannot do it all at once. The idea that I'm trying to do here is to use tf.data API to generate data and feed it to the .fit method. – codeless Feb 08 '21 at 16:08
`tf.data.Dataset` doesn't load everything in memory – Nicolas Gervais Feb 08 '21 at 16:12
It might not but in a loop like the one from the answer, psutil shows that the memory usage is indeed increasing. – codeless Feb 08 '21 at 16:34
some more experimentation: just by running the code shown above, the only way to free up memory is to actually delete train_dataset, which defeats the purpose in the first place.. not sure why this is happening – codeless Feb 08 '21 at 17:59
The docs don't mention an iterative way of loading data from `.npy` https://www.tensorflow.org/tutorials/load_data/numpy – Nicolas Gervais Feb 08 '21 at 18:00

How to create a tf.data pipeline with multiple .npy files

1 Answers1