-1

I recently started learning the deep learning with pytorch using this tutorial.

I am having problem with these lines of code.

Parameter train=True means it will take out the training data.

But how much data does it take for the training 50%?

How can we specify the amount of data for training. Similarly, couldn't understand batch_size and num_workers, what that means in loading the data data? Is the batch_size parameter is similar to one used in deep learning for training?

                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)
Community
  • 1
  • 1
Aadnan Farooq A
  • 626
  • 4
  • 8
  • 19

2 Answers2

1

If you don't split your data previously, the trainloader will use the entire train folder. You can specify the amount of training by splitting your data, see:

from torchvision import datasets

# convert data to a normalized torch.FloatTensor
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
    ])

# choose the training and test datasets
train_data = datasets.CIFAR10('data', train=True,
                              download=True, transform=transform)
test_data = datasets.CIFAR10('data', train=False,
                             download=True, transform=transform)
valid_size = 0.2

# obtain training indices that will be used for validation
num_train = len(train_data)
indices = list(range(num_train))
np.random.shuffle(indices)
split = int(np.floor(valid_size * num_train))
train_idx, valid_idx = indices[split:], indices[:split]

# define samplers for obtaining training and validation batches
train_sampler = SubsetRandomSampler(train_idx)
valid_sampler = SubsetRandomSampler(valid_idx)

# prepare data loaders (combine dataset and sampler)
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
    sampler=train_sampler, num_workers=num_workers)
valid_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, 
    sampler=valid_sampler, num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, 
    num_workers=num_workers)```

The Batch size is the numbers of files that you catch by iteration (epoch). For example, if your training_size is 1000, and you have a batch_size of 10, then each epoch would contain 100 iterations.

The number of workers is used to preprocess the data of batch. More workers will consume more memory usage and workers are helpful to speed up the Input and Output process. num_workers = 0 means that will do the data loading when needed, num_workers > 0 means your data will be preprocessed with the number of workers you defined.

Isac Moura
  • 5,940
  • 3
  • 13
  • 27
  • In the above code, it will split the data as 60-20-20 for training, validation and testing? Means the batch size, is the same we generally used for the deep learning, isn't? – Aadnan Farooq A Jan 28 '19 at 03:16
  • secondly, how can I define initial variables, ;like `train_data`.. I am using the above tutorial i want to fit your code in that. – Aadnan Farooq A Jan 28 '19 at 03:30
  • For your first question: the algorithm is splitting the `train_data ` into 80% for training and 20% for validation (validation is made during the training process). So, I have my `test_data` (a dataset that my neural network never seen before) who is took after the training of my neural network to test her accuracy. Yes, is the same batch size. – Isac Moura Jan 28 '19 at 04:37
  • About your second question: `train_data` is your dataloader. On this code, specifically I'm using the CIFAR10 dataset already provided by `torchvision`. I'll edit my answer to include this part of code. – Isac Moura Jan 28 '19 at 04:39
  • The structure is the same, but you have to specify the path of your dataset, batch_size, number of workers, if you want to shuffle the images and so on. – Isac Moura Jan 28 '19 at 04:45
0

batch_size is the size of batches(group of data from the data-set you provided) that you want and num_workers is the number of workers which work on the batches, basically multiprocessing workers.

But how much data it take for the training 50%?

DataLoader doesn't provide you with any way to control the number of samples you wish to extract. You will have to use the typical ways of slicing iterators.

Simplest thing to do (without any libraries) would be to stop after the required number of samples is reached.

nsamples = 10000
for i, image, label in enumerate(train_loader):
    if i > nsamples:
        break
# Your training code here.

Or, you could use itertools.islice to get the first 10k samples. Like so.

for image, label in itertools.islice(train_loader, stop=10000):

    # your training code here.

you can refer to this answer

Himanshu Bansal
  • 2,003
  • 1
  • 23
  • 46