4

I am implementing DistributedDataParallel training to a simple CNN for torchvision.datasets.MNIST simultaneously running on 3 distributed nodes. I want to partition the datasets into 3 non-overlapping subsets (A,B,C) and that should contain 20000 images each. Individual subsets should be further split into training and testing partitions, i.e. 0.7% training and 0.3% testing. I plan to provide each subset to each distributed node separately so that they can train and test in a DistributedDataParallel fashion.

The basic code as shown below, downloads MNIST dataset from torchvision.datasets.MNIST and then uses torch.utils.data.distributed.DistributedSampler and torch.utils.data.DataLoader to create data batches for training and testing on a single node.


# TRAINING DATA

train_dataset = datasets.MNIST('data', train=True, download=True, transform=transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.1307,), (0.3081,))]))
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, num_replicas=3, rank=dist.get_rank())
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=256, shuffle=False, num_workers=3, pin_memory=True, sampler=True)


# TESTING DATA

test_dataset = datasets.MNIST('data', train=False, download=False, transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]))
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=256, shuffle=False, num_workers=3, pin_memory=True)

I expect the answer should create train_dataset_a, train_dataset_b, and train_dataset_c, as well as, test_dataset_a, test_dataset_b, and test_dataset_c.

Anjum
  • 183
  • 9
  • 1
    We built BetterLoader (https://binitai.github.io/BetterLoader/) to do stuff just like this! The project's still really new, but you may find it helpful if you're still working on similar problems :) – Raghav Sep 15 '20 at 22:02

0 Answers0