4

I'm working on trying to compare the converge rate of SGD and GD algorithms for the neural networks. In PyTorch, we often use SGD optimizer as follows.

train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
for epoch in range(epochs):
    running_loss = 0

    for input_batch, labels_batch in train_dataloader:

        input = input_batch

        y_hat = model(input)
        y = labels_batch
        L = loss(y_hat, y)
        
        optimizer.zero_grad()
        L.backward()
        
        optimizer.step()
        
        running_loss += L.item()

My understanding about the optimizer here is that the SGD optimizer actually does the Mini-batch Gradient Descent algorithm because we feed the optimizer one batch of data at one time. So, if we set the batch_size parameter as the size of all data, the code actually does Gradient Descent for the neural network.

Is my understanding correct?

mathgeek
  • 43
  • 5
  • Can you please clarify the question again? SGD (or any other optimizer) does the same think if you use it on a batch size of 1 or all of your data. If the batch size is more than 1 then the loss is average of the batch (MSE). A lot of researchers prefer a batch size of 32 or 64. Batch size also depends on how much you can sent to the video card. – Bhupen Jun 05 '22 at 05:21
  • @Bhupen: Hi, Bhupen. Yes, in SGD the loss is the average of the batch (MSE). But in PyTorch, we control our input size, like the size of **input_batch** is 64 in my code. My question is when we run code `optimizer.step()`, the optimizer computes the total loss of these 64 inputs and then does Gradient Descent or it computes the loss of 1 input, performs Gradient Descent and then loop 64 times? – mathgeek Jun 06 '22 at 00:41
  • 1
    Ah I see, GD (or any other optimization) is done on the batch not on the individual sample. – Bhupen Jun 06 '22 at 13:51

1 Answers1

3

Your understanding is correct. SGD is just updating weights based on the gradient computed by backpropagation. The flavor of gradient descent that it performs is therefore determined by the data loader.

  • Gradient descent (aka batch gradient descent): Batch size equal to the size of the entire training dataset.
  • Stochastic gradient descent: Batch size equal to one and shuffle=True.
  • Mini-batch gradient descent: Any other batch size and shuffle=True. By far the most common in practical applications.
jodag
  • 19,885
  • 5
  • 47
  • 66