Using TPU, I have tried to pass experimental_steps_per_execution to model.compile(...), I do see a big speedup, but for the exact learning rate schedule, I noticed a 2-3% drop in accuracy when training is done. In summary, the only thing I changed is that parameter.
I have not yet located any detailed documentation concerning this parameter. While it seems to speed up training, I am unclear about the "algorithm" difference, esp. concerning how the gradients are computed and gradient descent steps are done.
Anyone knows more about this? Do i need to tune other things such as my learning rate or batch_size?