I am experimenting with the Q-learning algorithm. I have read from different sources and understood the algorithm, however, there seem to be no clear convergence criteria that is mathematically backed.
Most sources recommend iterating several times (example, N = 1000), while others say convergence is achieved when all state and action pairs (s, a) are visited infinitely often. But the question here is, how much is infinitely often. What is the best criteria for someone who wants to solve the algorithm by hand?
I would be grateful if someone could educate me on this. I would also appreciate any articles to this effect.
Regards.