0

I'm trying to load a dataset, stored in two .npy files (for features and ground truth) on my drive, and use it to train a neural network.

print("loading features...")
data = np.load("[...]/features.npy")

print("loading labels...")
labels = np.load("[...]/groundtruth.npy") / 255

dataset = tf.data.Dataset.from_tensor_slices((data, labels))

throws a tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized. error when calling the from_tensor_slices() method.

The ground truth's file is larger than 2.44GB and thus I encounter problems when creating a Dataset with it (see warnings here and here).

Possible solutions I found were either for TensorFlow 1.x (here and here, while I am running version 2.6) or to use numpy's memmap (here), which I unfortunately don't get to run, plus I wonder whether that slows down the computation?

I'd appreciate your help, thanks!

AloneTogether
  • 25,814
  • 5
  • 20
  • 39
babrs
  • 74
  • 8
  • 1
    I ended up splitting my dataset into two parts and read it that way, but your recommendation helped me understand the underlying problem and think outside the box. I'll mark it as answer, thank you again :) – babrs Nov 19 '21 at 14:30

2 Answers2

1

You need some kind of data generator, because your data is way too big to fit directly into tf.data.Dataset.from_tensor_slices. I don't have your dataset, but here's an example of how you could get data batches and train your model inside a custom training loop. The data is an NPZ NumPy archive from here:

import numpy as np

def load_data(file='dsprites_ndarray_co1sh3sc6or40x32y32_64x64.npz'):
    dataset_zip = np.load(file, encoding='latin1')

    images = dataset_zip['imgs']
    latents_classes = dataset_zip['latents_classes']

    return images, latents_classes

def get_batch(indices, train_images, train_categories):
    shapes_as_categories = np.array([train_categories[i][1] for i in indices])
    images = np.array([train_images[i] for i in indices])

    return [images.reshape((images.shape[0], 64, 64, 1)).astype('float32'), shapes_as_categories.reshape(
        shapes_as_categories.shape[0], 1).astype('float32')]

# Load your data once
train_images, train_categories = load_data()
indices = list(range(train_images.shape[0]))
random.shuffle(indices)

epochs = 2000
batch_size = 256
total_batch = train_images.shape[0] // batch_size

for epoch in range(epochs):
    for i in range(total_batch):
        batch_indices = indices[batch_size * i: batch_size * (i + 1)]
        batch = get_batch(batch_indices, train_images, train_categories)
        ...
        ...
        # Train your model with this batch.
AloneTogether
  • 25,814
  • 5
  • 20
  • 39
  • Thanks for your quick answer, it's actually training now... My RAM is almost completely full however (32GB) slowing down training, even though features and labels combined are far less than 3GB (disk space), can you think of a reason for this? – babrs Nov 16 '21 at 22:58
  • How big is your batch size ? – AloneTogether Nov 17 '21 at 06:25
  • I'm currently training with a batch size of 64, where each feature vector is a one dimensional array of bools with 96 entries and each label vector is a one dimensional array of 640 uint8. – babrs Nov 17 '21 at 12:29
  • You might have to lower the batch size, but it is hard to say what exactly the reason is. I just wanted to point you in the right direction – AloneTogether Nov 17 '21 at 12:32
1

The accepted answer (https://stackoverflow.com/a/69994287) loads the whole data into memory with the load_data function. That's why our RAM is completely full.

You can try to make a npz file where each feature is its own npy file, then create a generator that loads this and use this generator like 1 to use it with tf.data.Dataset or build a data generator with keras like 2

or use the mmap method of numpy load while loading to stick to your one npy feature file like 3