DataLoader messing up transformed data

Question

I am testing the MNIST dataset in Pytorch, and after I apply a transformation to the X data, it seems the DataLoader puts all values out of the original order, potentially messing up the training step.

My transformation is to divide all values by 255. One should notice that the transformation itself does not change positions, as shown by the first scatterplots. But after the data is passed to the DataLoader and I retrieve it back, they are out of order. If I make no transformation, everything is fine (not shown). The distribution of the values is the same among before, after1 (divided by 255/before DataLoader) and after2 (divided by 255/after DataLoader) (also not shown), only the order seems to be affected.

import torch
from torchvision import datasets
import torchvision.transforms as transforms
import matplotlib.pyplot as plt

transform = transforms.ToTensor()

train = datasets.MNIST(root = '.', train = True, download = True, transform = transform)
test = datasets.MNIST(root = '.', train = False, download = True, transform = transform)

before = train.data[0]

train.data = train.data.float()/255
after1 = train.data[0]

train_loader = torch.utils.data.DataLoader(train, batch_size = 128)
test_loader = torch.utils.data.DataLoader(test, batch_size = 128)

fig, ax = plt.subplots(1, 2)
ax[0].scatter(range(len(before.view(-1))), before.view(-1))
ax[0].set_title('Before')
ax[1].scatter(range(len(after1.view(-1))), after1.view(-1))
ax[1].set_title('After1')

after2 = next(iter(train_loader))[0][0]

fig, ax = plt.subplots(1, 2)
ax[0].scatter(range(len(before.view(-1))), before.view(-1))
ax[0].set_title('Before')
ax[1].scatter(range(len(after2.view(-1))), after2.view(-1))
ax[1].set_title('After2')

fig, ax = plt.subplots(1, 3)
ax[0].imshow(before, cmap = 'gray')
ax[1].imshow(after1, cmap = 'gray')
ax[2].imshow(after2.view(28, 28), cmap = 'gray')

I know that this might not be the best way to deal with this data (transforms.Normalize should solve it), but I would really like to understand what is happening.

Thank you!

score 1 · Answer 1 · answered Sep 14 '19 at 20:52

So... I posted this same question at Pytorch's GitHub page, and they answered the following:

It's unrelated to data loader. You are messing with an attribute of the particular dataset object, however, the actual __getitem__ of that object does much more: https://github.com/pytorch/vision/blob/6de158c473b83cf43344a0651d7c01128c7850e6/torchvision/datasets/mnist.py#L92

In particular this line (mode='L') assumes uint8 input. Since you replaced it with float, it is wrong.

Then I guess the preferred approach would be to apply a transform when preparing the dataset at the very beginning of my code.

This dataset `MNIST(VisionDataset):` is a lot less awesome than you may expect. I [have similar answer](https://stackoverflow.com/a/56672709/5884955) you may find helpful. — prosti, Sep 18 '19 at 12:13

prosti · Answer 2 · 2019-09-14T04:52:00.860

Originally I haven't tested the code you wrote. Rewrote the original:

import torch
from torchvision import datasets
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, Dataset, TensorDataset
import matplotlib.pyplot as plt

transform = transforms.ToTensor()

train = datasets.MNIST(root = '.', train = True, download = True, transform = transform)
test = datasets.MNIST(root = '.', train = False, download = True, transform = transform)

dl = DataLoader(train)

images = dl.dataset.data.float()/255
labels = dl.dataset.targets

train_ds = TensorDataset(images, labels)
train_loader = DataLoader(train_ds, batch_size=128)
# img, target = next(iter(train_loader))

before = train.data[0]
train.data = train.data.float()/255
after1 = train.data[0]

# train_loader = torch.utils.data.DataLoader(train, batch_size = 128)
test_loader = torch.utils.data.DataLoader(test, batch_size = 128)

fig, ax = plt.subplots(1, 2)
ax[0].scatter(range(len(before.view(-1))), before.view(-1))
ax[0].set_title('Before')
ax[1].scatter(range(len(after1.view(-1))), after1.view(-1))
ax[1].set_title('After1')

after2 = next(iter(train_loader))[0][0]

fig, ax = plt.subplots(1, 2)
ax[0].scatter(range(len(before.view(-1))), before.view(-1))
ax[0].set_title('Before')
ax[1].scatter(range(len(after2.view(-1))), after2.view(-1))
ax[1].set_title('After2')

fig, ax = plt.subplots(1, 3)
ax[0].imshow(before, cmap = 'gray')
ax[1].imshow(after1, cmap = 'gray')
ax[2].imshow(after2.view(28, 28), cmap = 'gray')

It didn't work. The `shuffle` argument is only related to whole instances, not within-instance data structure. As a fact, it is this internal structure that makes any prediction possible, shuffling it would make no sense. By the way, leaving `shuffle = True` does not affect data structure when I do not apply that transformation. (Also, `shuffle = False` by default.) — Denny Ceccon, Sep 13 '19 at 23:25
D, I couldn't figure out why original code haven't worked, if you figure out, let me know. — prosti, Sep 14 '19 at 04:53

DataLoader messing up transformed data

2 Answers2