Are Q-learning agents required to converge towards actual state-action values?

Question

It is my understanding that Q-learning attempts to find the actual state-action values for all states and actions. However, my hypothetical example below seems to indicate that this is not necessarily the case.

Imagine a Markov decision process (MDP) with the following attributes:

a state space S = {s_1} with only one possible state,
an action space A = {a_1} with a singular possible action,
a reward function R: S X A X S → ℝ with R(s_1, a_1, s_1) = 4
and finally a state transition function T: S X A X S → [0,1] which produces probability 1 for all actions A and state transitions from and to state s_1

Now assume that we have a single agent which has been initialized using optimistic initialization. For all possible states and actions we set the Q-value equal to 5 (i.e. Q(s_1, a_1) = 5). Q-values will be updated using the Bellman equation:

Q(S,A) := Q(S,A) + α( R + γQ(S',A') - Q(S,A) )

Here α and γ are chosen such that α = (0,1] and γ = (0,1]. Notice that we will require α and γ to be non-zero.

When the agent selects its action (a_1) in state s_1, the update formula becomes:

Q(s_1, a_1) := 5 + α( 4 + γ5 - 5 )

Notice that the Q-value does not change when γ5 = 1, or more generally when γQ(S,A) = Q(S,A) - R. Also, the Q-value will increase when γQ(S,A) > Q(S,A) - R, which would further increase the difference between the actual state-action value and the expected state-action value.

This seems to indicate that in some cases, it is possible for the difference between the actual and expected state-action values to increase over time. In other words, it is possible for the expected value to diverge from the actual value.

If we were to initialize the Q-values to equal 0 for all states and actions, we surely would not end up in this situation. However, I do believe it possible that a stochastic reward/transition function may cause the agent to over estimate its state-action values in a similar fashion, causing the above behavior to take effect. This would require a highly improbable situation where the MDP transitions to a high payoff state often, even though this transition has a very low likelihood.

Perhaps there are any assumptions I made here that actually do not hold. Maybe the target goal is not to precisely estimate the true state-action value, but rather that convergence to optimal state-action values is sufficient. That being said, I do find it rather odd that the divergence behavior between actual and expected returns is possible.

Any thoughts on this would be appreciated.

score 0 · Answer 1 · answered Jul 21 '21 at 09:27

The problem with the above assumption is that I expected Q(s,a) to converge to R(s,a,s'). This is not the case. As described in the RL book by Sutton and Barto:

Q(s,a) = sum_r p(s',r|s,a)*r = E[r]

The Q-values, in this case, actually represent the expected one-step reward and should converge to R + γQ(S',A') and not R(s,a,s'). It is therefore unsurprising that the state-action values can move away from the deterministic immediate reward R and that the value at which Q(s,a) converges is dependent on γ.

Furthermore, the hypothetical situation where Q(s,a) is overestimated when using a stochastic reward/transition function is possible. Convergence to the actual state-action values is only guaranteed when all state-action pairs (s,a) are visited an infinite amount of times. Therefore, this is an issue related to exploration and exploitation. (In this case, the agent should have been allowed to explore more.)

Are Q-learning agents required to converge towards actual state-action values?

1 Answers1