17

My question is about how to get batch inputs from multiple (or sharded) tfrecords. I've read the example https://github.com/tensorflow/models/blob/master/inception/inception/image_processing.py#L410. The basic pipeline is, take the training set as as example, (1) first generate a series of tfrecords (e.g., train-000-of-005, train-001-of-005, ...), (2) from these filenames, generate a list and fed them into the tf.train.string_input_producer to get a queue, (3) simultaneously generate a tf.RandomShuffleQueue to do other stuff, (4) using tf.train.batch_join to generate batch inputs.

I think this is complex, and I'm not sure the logic of this procedure. In my case, I have a list of .npy files, and I want to generate sharded tfrecords(multiple seperated tfrecords, not just one single large file). Each of these .npy files contains different number of positive and negative samples (2 classes). A basic method is to generate one single large tfrecord file. But the file is too large (~20Gb). So I resort to sharded tfrecords. Are there any simpler way to do this?

Vijay Mariappan
  • 16,921
  • 3
  • 40
  • 59
mining
  • 3,557
  • 5
  • 39
  • 66

1 Answers1

51

The whole process is simplied using the Dataset API. Here are both the parts: (1): Convert numpy array to tfrecords and (2): read the tfrecords to generate batches.

1. Creation of tfrecords from a numpy array:

Example arrays:
inputs = np.random.normal(size=(5, 32, 32, 3))
labels = np.random.randint(0,2,size=(5,))

def npy_to_tfrecords(inputs, labels, filename):
  with tf.io.TFRecordWriter(filename) as writer:
    for X, y in zip(inputs, labels):
        # Feature contains a map of string to feature proto objects
        feature = {}
        feature['X'] = tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten()))
        feature['y'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[y]))

        # Construct the Example proto object
        example = tf.train.Example(features=tf.train.Features(feature=feature))

        # Serialize the example to a string
        serialized = example.SerializeToString()

        # write the serialized objec to the disk
        writer.write(serialized)


npy_to_tfrecords(inputs, labels, 'numpy.tfrecord')

2. Read the tfrecords using the Dataset API:

filenames = ['numpy.tfrecord']
dataset = tf.data.TFRecordDataset(filenames)
# for version 1.5 and above use tf.data.TFRecordDataset

# example proto decode
def _parse_function(example_proto):
    keys_to_features = {'X':tf.io.FixedLenFeature(shape=(32, 32, 3), dtype=tf.float32),
                      'y': tf.io.FixedLenFeature((), tf.int64, default_value=0)}
    parsed_features = tf.io.parse_single_example(example_proto, keys_to_features)
    return parsed_features['X'], parsed_features['y']

# Parse the record into tensors.
dataset = dataset.map(_parse_function)  
  
# Generate batches
dataset = dataset.batch(5)

Check the generated batches are proper:

for data in dataset:
    break
np.testing.assert_allclose(inputs[0] ,data[0][0])
np.testing.assert_allclose(labels[0] ,data[1][0])
Vijay Mariappan
  • 16,921
  • 3
  • 40
  • 59
  • Hi, sir, does this api support the `num_threads` or `capacity` like that in the `tf.train.shuffle_batch` api? In my case, if the network is small, then the execution in GPU is faster than the data loading, which leads to idle GPU time. So I want to the queue for fetching data is always full. Thanks. – mining Aug 10 '17 at 21:37
  • Thanks very much! – mining Aug 10 '17 at 23:34
  • Thanks for this nice example - using `reader = tf.TFRecordReader(); key, value = reader.read(filename_queue)` I get a key, value pair back (value corresponds to `example_proto` in your code). How can I get the key using the `dataset = tf.contrib.data.TFRecordDataset(filenames)` ? – Mr_and_Mrs_D Sep 30 '17 at 16:27
  • 1
    is it possible to store "shapeofnparray" in the TFRecord and then reshape using it similar to https://stackoverflow.com/a/42603692/2184122 ? I can't map between the old and the dataset way. – Robert Lugg Jan 09 '18 at 19:27
  • What exactly is `example_proto`? A string or byte data? where is that variable assigned? what is it assigned to? – Uchiha Madara Jul 16 '18 at 11:11
  • Thank you very much, I have some questions. Why did you flatten X? My numpy arrays are image arrays and I have 51 outputs for y. Do I also need to flatten them? moreover, when I try this code, ram goes as high as 90% (I have 32GB RAM) and the program crashes. Can you identify the problem? – Amin Marshal Dec 06 '19 at 17:35
  • I tried this, but I am getting `TypeError: array([ 3., 9., 3., ..., 17., 8., 17.], dtype=float32) has type numpy.ndarray, but expected one of: int, long, float ` error on the statement `example = tf.train.Example(features=tf.train.Features(feature=feature))` – Shantanu Shinde Oct 08 '20 at 00:21