I am implementing DistributedDataParallel training to a simple CNN for torchvision.datasets.MNIST simultaneously running on 3 distributed nodes. I want to partition the datasets into 3 non-overlapping subsets (A,B,C) and that should contain 20000 images each. Individual subsets should be further split into training and testing partitions, i.e. 0.7% training and 0.3% testing. I plan to provide each subset to each distributed node separately so that they can train and test in a DistributedDataParallel fashion.
The basic code as shown below, downloads MNIST dataset from torchvision.datasets.MNIST and then uses torch.utils.data.distributed.DistributedSampler and torch.utils.data.DataLoader to create data batches for training and testing on a single node.
# TRAINING DATA
train_dataset = datasets.MNIST('data', train=True, download=True, transform=transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.1307,), (0.3081,))]))
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, num_replicas=3, rank=dist.get_rank())
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=256, shuffle=False, num_workers=3, pin_memory=True, sampler=True)
# TESTING DATA
test_dataset = datasets.MNIST('data', train=False, download=False, transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]))
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=256, shuffle=False, num_workers=3, pin_memory=True)
I expect the answer should create train_dataset_a, train_dataset_b, and train_dataset_c, as well as, test_dataset_a, test_dataset_b, and test_dataset_c.