Training on minibatches of varying size

Question

I'm trying to train a deep learning model in PyTorch on images that have been bucketed to particular dimensions. I'd like to train my model using mini-batches, but the mini-batch size does not neatly divide the number of examples in each bucket.

One solution I saw in a previous post was to pad the images with additional whitespace (either on the fly or all at once at the beginning of training), but I do not want to do this. Instead, I would like to allow the batch size to be flexible during training.

Specifically, if N is the number of images in a bucket and B is the batch size, then for that bucket I would like to get N // B batches if B divides N, and N // B + 1 batches otherwise. The last batch can have fewer than B examples.

As an example, suppose I have indexes [0, 1, ..., 19], inclusive and I'd like to use a batch size of 3.

The indexes [0, 9] correspond to images in bucket 0 (shape (C, W1, H1))
The indexes [10, 19] correspond to images in bucket 1 (shape (C, W2, H2))

(The channel depth is the same for all images). Then an acceptable partitioning of the indexes would be

batches = [
    [0, 1, 2], 
    [3, 4, 5], 
    [6, 7, 8], 
    [9], 
    [10, 11, 12], 
    [13, 14, 15], 
    [16, 17, 18], 
    [19]
]

I would prefer to process the images indexed at 9 and 19 separately because they have different dimensions.

Looking through PyTorch's documentation, I found the BatchSampler class that generates lists of mini-batch indexes. I made a custom Sampler class that emulates the partitioning of indexes described above. If it helps, here's my implementation for this:

class CustomSampler(Sampler):

    def __init__(self, dataset, batch_size):
        self.batch_size = batch_size
        self.buckets = self._get_buckets(dataset)
        self.num_examples = len(dataset)

    def __iter__(self):
        batch = []
        # Process buckets in random order
        dims = random.sample(list(self.buckets), len(self.buckets))
        for dim in dims:
            # Process images in buckets in random order
            bucket = self.buckets[dim]
            bucket = random.sample(bucket, len(bucket))
            for idx in bucket:
                batch.append(idx)
                if len(batch) == self.batch_size:
                    yield batch
                    batch = []
            # Yield half-full batch before moving to next bucket
            if len(batch) > 0:
                yield batch
                batch = []

    def __len__(self):
        return self.num_examples

    def _get_buckets(self, dataset):
        buckets = defaultdict(list)
        for i in range(len(dataset)):
            img, _ = dataset[i]
            dims = img.shape
            buckets[dims].append(i)
        return buckets

However, when I use my custom Sampler class I generate the following error:

Traceback (most recent call last):
    File "sampler.py", line 143, in <module>
        for i, batch in enumerate(dataloader):
    File "/home/roflcakzorz/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 263, in __next__
        indices = next(self.sample_iter)  # may raise StopIteration
    File "/home/roflcakzorz/anaconda3/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 139, in __iter__
        batch.append(int(idx))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

The DataLoader class seems to expect to be passed indexes, not list of indexes.

Should I not be using a custom Sampler class for this task? I also considered making a custom collate_fn to pass to the DataLoader, but with that approach I don't believe I can control which indexes are allowed to be in the same mini-batch. Any guidance would be greatly appreciated.

score 0 · Accepted Answer · answered Jun 05 '18 at 23:53

0

Do you have 2 networks for each of the samples(A cnn kernel size has to be fix). If yes just pass the above custom_sampler to the batch_sampler args of DataLoader class. That would fix the issue.

answered Jun 05 '18 at 23:53

Kris

518
3
13

I'm not sure I understand the question. I do use a CNN to process the images, but the images are variable in size and the output of the CNN does not need to be a fixed size. Unfortunately, I get errors when I use my `CustomSampler` as an argument to the DataLoader class. – Roflcakzorz Jun 09 '18 at 03:03
I think what you mean to say is your batch size changes (-1, 1, 28, 28) you mean to say images are of the same size. If not so could you show me your code? Also, you can look [here](https://discuss.pytorch.org/t/random-sampler-implementation/18934/6) for my implementation of RandomSampler I think you can change it for your case. – Kris Jun 09 '18 at 09:15
Yes, so I guess there are two things that can change here, the batch size and the spatial dimensions of the images in the batch. However, for any given batch the images in that batch all have the same spatial dimension. – Roflcakzorz Jun 12 '18 at 18:19
It turns out that my error was in invoking my `CustomSampler` in the call to `DataLoader`. Embarrassingly, I didn't realize until the other day that `DataLoader` has separate keyword arguments for a sampler and a batch_sampler. Thank you for pointing this out to me. The `CustomSampler` class I implemented works as expected now. – Roflcakzorz Jun 12 '18 at 18:21

score 0 · Answer 2 · answered Jan 21 '20 at 13:17

Hi since every batch should contain images of the same dimension, your CustomSampler works just fine, it needs to be passed as an argument to mx.gluon.data.DataLoader, with the keyword, batch_sampler. However, as stated in the docs, do remember this:

"Do not specify shuffle, sampler, and last_batch if batch_sampler is specified"

Training on minibatches of varying size

2 Answers2