writing numpy to tfrecord is slow when looping over features using tfrecordwriter

Question

I have a numpy array that i want to write to a tfrecord file. The dimensions of the array for both input X and label y are [200,46,72,72]for the training of my model i want to read the tfrecord file to get slices of [72,72] both for input and label data.

I tried to apply the following stackoverflow answer

The problem is that this method is really slow probably due to the amount of elements looped over 200*46. When I write the entire numpy array as feature of bytes instead of floatlist I don't have this problem, but than I don't understand how to get [72,72] slices for each batch.

def npy_to_tfrecords(X,y):
    # write records to a tfrecords file
    output_file = 'E:\\Documents\\Datasets\\tfrecordtest\\test.tfrecord'
    writer = tf.python_io.TFRecordWriter(output_file)


    # Loop through all the features you want to write
    for i in range(X.shape[0]) :
         for j in range(X.shape[1]) :
            #let say X is of np.array([[...][...]])
            #let say y is of np.array[[0/1]]
            print(f"{i},{j}")
            # Feature contains a map of string to feature proto objects

            feature = {}
            feature['X'] = tf.train.Feature(float_list=tf.train.FloatList(value=X[i,j:,:].flatten()))
            feature['y'] = tf.train.Feature(float_list=tf.train.FloatList(value=y[i,j:,:].flatten()))

            # Construct the Example proto object
            example = tf.train.Example(features=tf.train.Features(feature= feature) )

            # Serialize the example to a string
            serialized = example.SerializeToString()

            # write the serialized objec to the disk
            writer.write(serialized)
    writer.close()

for reading i use the following code roughly

dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.map(_parse_function, num_parallel_calls=6)
dataset.apply(tf.contrib.data.shuffle_and_repeat(SHUFFLE_BUFFER))
dataset = dataset.batch(BATCH_SIZE)
iterator = dataset.make_one_shot_iterator()
input_data, label_data = iterator.get_next()

when i save the numpy arrays as bytes the parse_function returns the whole array and I can not figure out how to write a parse_function that returns slices.

Summary:

save 2 numpy arrays to tfrecord
read tfrecord file and obtain slices of the saved numpy arrays in the batches used for the model

writing numpy to tfrecord is slow when looping over features using tfrecordwriter

Summary:

0 Answers0