2

Consider this problem: select a random number of samples from a random subject in an image dataset (like ImageNet) as an input element for Tensorflow graph which functions as an object set recognizer. For each batch, each class has a same number of samples to facilitate computation. But a different batch would have a different number of images for one class, i.e. batch_0:num_imgs_per_cls=2; batch_1000:num_imgs_per_cls=3.

If there is existing functionality in Tensorflow, explanation for the whole process from scratch (like from directories of images) will be really appreciated.

ArthurSeat
  • 23
  • 4

1 Answers1

5

There is a very similar answer by @mrry here.

Sampling balanced batches

In face recognition we often use triplet loss (or similar losses) to train the model. The usual way to sample triplets to compute the loss is to create a balanced batch of images where we have for instance 10 different classes (i.e. 10 different people) with 5 images each. This gives a total batch size of 50 in this example.

More generally the problem is to sample num_classes_per_batch (10 in the example) classes, and then sample num_images_per_class (5 in the example) images for each class. The total batch size is:

batch_size = num_classes_per_batch * num_images_per_class

Have one dataset for each class

The easiest way to deal with a lot of different classes (100,000 in MS-Celeb) is to create one dataset for each class.
For instance you can have one tfrecord for each class and create the datasets like this:

# Build one dataset per class.
filenames = ["class_0.tfrecords", "class_1.tfrecords"...]
per_class_datasets = [tf.data.TFRecordDataset(f).repeat(None) for f in filenames]

Sample from the datasets

Now we would like to be able to sample from these datasets. For instance we want the following labels in our batch:

1 1 1 3 3 3 9 9 9 4 4 4

This corresponds to num_classes_per_batch=4 and num_images_per_class=3.

To do this we will need to use features that will be released in r1.9. The function should be called tf.contrib.data.choose_from_datasets (see here for a discussion on this).
It should look like:

def choose_from_datasets(datasets, selector):
    """Chooses elements with indices from selector among the datasets in `datasets`."""

So we create this selector which will output 1 1 1 3 3 3 9 9 9 4 4 4 and combine it with datasets to obtain our final dataset that will output balanced batches:

def generator(_):
    # Sample `num_classes_per_batch` classes for the batch
    sampled = tf.random_shuffle(tf.range(num_classes))[:num_classes_per_batch]
    # Repeat each element `num_images_per_class` times
    batch_labels = tf.tile(tf.expand_dims(sampled, -1), [1, num_images_per_class])
    return tf.to_int64(tf.reshape(batch_labels, [-1]))

selector = tf.contrib.data.Counter().map(generator)
selector = selector.apply(tf.contrib.data.unbatch())

dataset = tf.contrib.data.choose_from_datasets(datasets, selector)

# Batch
batch_size = num_classes_per_batch * num_images_per_class
dataset = dataset.batch(batch_size)

You can test this with the nightly TensorFlow build and by using DirectedInterleaveDataset as a workaround:

# The working option right now is 
from tensorflow.contrib.data.python.ops.interleave_ops import DirectedInterleaveDataset
dataset = DirectedInterleaveDataset(selector, datasets)

I also wrote about this workaround here.

Olivier Moindrot
  • 27,908
  • 11
  • 92
  • 91
  • I just gave your code a try. I simply made `datasets` by `datasets = [tf.data.Dataset.range(x, x+10) for x in range(0,100,10)]`. Other params were set as follows: `num_classes = 10`,`num_classes_per_batch = 4`,`num_images_per_class = 3`. However, the output was not as expected: `[70 71 72 20 21 22 10 11 12 80 81 82],[30 31 32 50 51 52 73 74 75 40 41 42],[76 77 78 0 1 2 53 54 55 60 61 62],[90 91 92 13 14 15 23 24 25 79 63 64],[65 16 17 18 26 27 28 83 84 85 43 44],[45 93 94 95 3 4 5 19 66 67 68 86],[87 88 96 97 98 6 7 8 46 47 48 99],[ 9 89 29 49 33 34 35 36 37 38 56 57],[58 59 69 39]` – ArthurSeat May 29 '18 at 14:28
  • One tiny nit: in code review we decided to change the name `select_from_datasets` to `choose_from_datasets`, but the functionality is the same. It has been submitted internally and should appear in the next merge. – mrry May 29 '18 at 14:42
  • Thanks for the update @mrry, I hope the syntax `def choose_from_datasets(datasets, selector)` is also correct :) – Olivier Moindrot May 29 '18 at 15:04
  • @ArthurSeat: the results look good to me. At the end you run out of examples (because each dataset only has 10 examples) so I'm going to add a `repeat(None)` statement at the beginning to make sure that there is always data available. We should also shuffle the input datasets before using them. – Olivier Moindrot May 29 '18 at 15:06
  • @OlivierMoindrot @mrry It threw an OOM issue when the model run about 100 steps.`tensorflow.python.framework.errors_impl.ResourceExhaustedError: /data1/msceleb1m_per_class_test/168.tfrecords; Too many open files [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[[?,112,112,?], [?]], output_types=[DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]` Is it because the `num_classes` is too large? In my case it was 1000. – ArthurSeat Jun 18 '18 at 03:26