When implementing mini batch gradient descent is it better to chose the training exemples-to compute the derivatives- randomly? Or would it be better to shuffle the whole training exemples then iterate trough them and shuffle everytime? The first method may lead us to jumping the global minimum.
Asked
Active
Viewed 76 times
1 Answers
0
Sorting the input data would mean that the model is trained on a non-representative set of inputs. You have changed the distribution - likely quite considerably.
When you are using the more standard approach of a randomly selected (and hopefully representative) batch from the overall dataset , then jumping the global minimum is still a possibility. There are many approaches to help reduce this chance. You might need to look at graduated adjustments to reduce step sizes like simulated annealing
.

WestCoastProjects
- 58,982
- 91
- 316
- 560