-1

When implementing mini batch gradient descent is it better to chose the training exemples-to compute the derivatives- randomly? Or would it be better to shuffle the whole training exemples then iterate trough them and shuffle everytime? The first method may lead us to jumping the global minimum.

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560

1 Answers1

0

Sorting the input data would mean that the model is trained on a non-representative set of inputs. You have changed the distribution - likely quite considerably.

When you are using the more standard approach of a randomly selected (and hopefully representative) batch from the overall dataset , then jumping the global minimum is still a possibility. There are many approaches to help reduce this chance. You might need to look at graduated adjustments to reduce step sizes like simulated annealing.

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560