How to create tf.data.dataset from directories of tfrecords?

Question

My dataset has different directories and each directory is corresponding to one class. There are different numbers of .tfrecords in each directory. My question is that how can I sample 5 images (each .tfrecord file corresponds to one image) from each directory? My other question is that how can I sample 5 of these directories and then sample 5 images from each.

I just want to do it with tf.data.dataset. So I want to have a dataset from which I get an iterator and that iterator.next() gives me a batch of 25 images containing 5 samples from 5 classes.

This may sounds silly, but since you need exact 5 images from each class, why not create 5 `tf.data.dataset` instances each with `batch_size` of 5? Otherwise, `tf.data.TFRecordDataset` can accept a list of strings as input but you have less control over the sampling process. — Richard_wth, May 20 '18 at 02:26
Then if I want to do another experiment with 6 samples, I have to create the files again. And the same thing happens for 10 samples and etc. — Siavash, May 22 '18 at 20:13

score 13 · Accepted Answer · edited Jul 04 '18 at 15:24

EDIT: If the number of classes is greater than 5, then you can use the new tf.contrib.data.sample_from_datasets() API (currently available in tf-nightly and will be available in TensorFlow 1.9).

directories = ["class_0/*", "class_1/*", "class_2/*", "class_3/*", ...]

CLASSES_PER_BATCH = 5
EXAMPLES_PER_CLASS_PER_BATCH = 5
BATCH_SIZE = CLASSES_PER_BATCH * EXAMPLES_PER_CLASS_PER_BATCH
NUM_CLASSES = len(directories)


# Build one dataset per class.
per_class_datasets = [
    tf.data.TFRecordDataset(tf.data.Dataset.list_files(d)) for d in directories]

# Next, build a dataset where each element is a vector of 5 classes to be chosen
# for a particular batch.
classes_per_batch_dataset = tf.contrib.data.Counter().map(
    lambda _: tf.random_shuffle(tf.range(NUM_CLASSES))[:CLASSES_PER_BATCH]))

# Transform the dataset of per-batch class vectors into a dataset with one
# one-hot element per example (i.e. 25 examples per batch).
class_dataset = classes_per_batch_dataset.flat_map(
    lambda classes: tf.data.Dataset.from_tensor_slices(
        tf.one_hot(classes, num_classes)).repeat(EXAMPLES_PER_CLASS_PER_BATCH))

# Use `tf.contrib.data.sample_from_datasets()` to select an example from the
# appropriate dataset in `per_class_datasets`.
example_dataset = tf.contrib.data.sample_from_datasets(per_class_datasets,
                                 class_dataset)

# Finally, combine 25 consecutive examples into a batch.
result = example_dataset.batch(BATCH_SIZE)

If you have exactly 5 classes, you can define a nested dataset for each directory and combine them using Dataset.interleave():

# NOTE: We're assuming that the 0th directory contains elements from class 0, etc.
directories = ["class_0/*", "class_1/*", "class_2/*", "class_3/*", "class_4/*"]
directories = tf.data.Dataset.from_tensor_slices(directories)
directories = directories.apply(tf.contrib.data.enumerate_dataset())    

# Define a function that maps each (class, directory) pair to the (shuffled)
# records in those files.
def per_directory_dataset(class_label, directory_glob):
  files = tf.data.Dataset.list_files(directory_glob, shuffle=True)
  records = tf.data.TFRecordDataset(records)
  # Zip the records with their class. 
  # NOTE: This part might not be necessary if the records contain information about
  # their class that can be parsed from them.
  return tf.data.Dataset.zip(
      (records, tf.data.Dataset.from_tensors(class_label).repeat(None)))

# NOTE: The `cycle_length` and `block_length` here aren't strictly necessary,
# because the batch size is exactly `number of classes * images per class`.
# However, these arguments may be useful if you want to decouple these numbers.
merged_records = directories.interleave(per_directory_dataset,
                                        cycle_length=5, block_length=5)
merged_records = merged_records.batch(25)

That does look more elegant than my take on it. :) I'm wondering however: would this work with `num_classes > 5`? I couldn't find a way to use `Dataset.interleave()` to pick elements of exactly 5 classes per batch in that case... — benjaminplanche, May 21 '18 at 18:16
That would depend on what sort of mix you wanted in the resulting batches. One option would be to set `cycle_length=num_classes`, and try tweaking `block_length`, but those would give you a deterministic mix, which might not be desirable. In TF 1.9 (and the current nightlies) you could use `tf.contrib.data.sample_from_datasets()`, which lets you sample randomly from a list of input datasets according to a specific weight distribution, and would give more control, especially if the weights are themselves a dataset of distributions indicating what class to pick. — mrry, May 21 '18 at 18:21
I just gave your code a try. As it is, it would generate batches only with the 5 first classes until they run out, before sampling from the next ones it seems. But yes, I guess it depends what kind of mix OP wants. I didn't know about `tf.contrib.data.sample_from_datasets()`, which seems quite an useful function. Thanks for sharing! — benjaminplanche, May 21 '18 at 18:30
I tried your code. It seems like I always get the same class samples at each iterator.next(). What I want is to get 5 different classes each time I call iterator.next(). — Siavash, May 23 '18 at 17:21
@Siavash I think I understand your question better now... please see the updated version. — mrry, May 23 '18 at 20:41
@mrry Thank you for your response. Can you please tell me where can I find more information about tf.contrib.data.sample_from_datasets? — Siavash, May 24 '18 at 20:12
It looks like the docs on the website have not been regenerated recently, but here's a [link to the docstring](https://github.com/tensorflow/tensorflow/blob/28340a4b12e286fe14bb7ac08aebe325c3e150b4/tensorflow/contrib/data/python/ops/interleave_ops.py#L198). — mrry, May 25 '18 at 15:30

score 3 · Answer 2 · answered May 21 '18 at 14:58

Please find below a potential solution.

For the sake of the demonstration, I am using a python generator instead of TFRecords as input (I am supposing you know how to use TF Dataset to read and parse the files in each folder. Other threads are otherwise covering this, e.g. here).

import tensorflow as tf
import numpy as np

def get_class_generator(class_id, num_el, el_shape=(32, 32), el_dtype=np.int32):
    """ Returns a dummy generator, 
        outputting "num_el" elements of a single class (input data & class label) 
    """
    def class_generator():
        x = 0
        for x in range(num_el):
            element = np.ones(el_shape, dtype=el_dtype) * x
            yield element, class_id
    return class_generator


def concatenate_datasets(datasets):
    """ Concatenate a list of datasets together.
        Snippet by user2781994 (https://stackoverflow.com/a/49069420/624547)
    """
    ds0 = tf.data.Dataset.from_tensors(datasets[0])
    for ds1 in datasets[1:]:
        ds0 = ds0.concatenate(tf.data.Dataset.from_tensors(ds1))
    return ds0


num_classes = 11
class_batch_size = 3
num_classes_per_batch = 5
# note: using 3 instead of 5 for class_batch_size in this example 
#       just to distinguish between the 2 vars.

# Initializing per-class datasets:
# (note: replace tf.data.Dataset.from_generator(...) to suit your use-case
#        e.g. tf.contrib.data.TFRecordDataset(glob.glob(perclass_tfrecords_path))
#                            .map(your_parsing_function)
class_datasets = [tf.data.Dataset
                 .from_generator(get_class_generator(
                      class_id, num_el=np.random.randint(1, 60) 
                      # ^ simulating unequal number of samples per class
                      ), (tf.int32, tf.int32), ([32, 32], []))
                 .repeat(-1)
                 .batch(class_batch_size)
                  for class_id in range(num_classes)]

# Initializing complete dataset:
dataset = (tf.data.Dataset
           # Concatenating all the class datasets together:
           .zip(tuple(class_datasets))
           .flat_map(lambda *args: concatenate_datasets(args))
           # Shuffling the class datasets:
           .shuffle(buffer_size=num_classes)
           # Flattening batches from shape (num_classes_per_batch, class_batch_size, ...)
           # into (num_classes_per_batch * class_batch_size, ...):
           .flat_map(lambda *args: tf.data.Dataset.from_tensor_slices(args))
           # Returning correct number of el. (num_classes_per_batch * class_batch_size):
           .batch(num_classes_per_batch * class_batch_size))

# Visualizing results:
next_batch = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
    for i in range(10):
        batch = sess.run(next_batch)
        print(">> batch {}".format(i))
        print("- inputs shape: {} ; label shape: {}".format(batch[0].shape,batch[1].shape))
        print("- class values: {}".format(batch[1]))

Outputs:

>> batch 0
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [ 1  1  1  0  0  0 10 10 10  2  2  2  9  9  9]
>> batch 1
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [0 0 0 2 2 2 3 3 3 5 5 5 6 6 6]
>> batch 2
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [ 9  9  9  8  8  8  4  4  4  3  3  3 10 10 10]
>> batch 3
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [7 7 7 8 8 8 6 6 6 6 6 6 2 2 2]
>> batch 4
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [1 1 1 0 0 0 1 1 1 8 8 8 5 5 5]
>> batch 5
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [2 2 2 4 4 4 9 9 9 5 5 5 5 5 5]
>> batch 6
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [0 0 0 7 7 7 3 3 3 9 9 9 7 7 7]
>> batch 7
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [10 10 10 10 10 10  1  1  1  6  6  6  7  7  7]
>> batch 8
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [4 4 4 3 3 3 5 5 5 6 6 6 3 3 3]
>> batch 9
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [8 8 8 9 9 9 2 2 2 8 8 8 0 0 0]

For batch 5: - inputs shape: (15, 32, 32) ; label shape: (15,) - class values: [2 2 2 4 4 4 9 9 9 5 5 5 5 5 5]. This is not what I want. I want to get the same amount of samples from each class. Here we have 6 samples from class 5. — Siavash, May 23 '18 at 21:23
Yes; with this solution, one class may appear twice in a batch (hence the ``2 * 3 ` samples for class `5`). @mrry's solution may avoid that. — benjaminplanche, May 23 '18 at 21:51

How to create tf.data.dataset from directories of tfrecords?

2 Answers2

Linked