It is my understanding that Q-learning attempts to find the actual state-action values for all states and actions. However, my hypothetical example below seems to indicate that this is not necessarily the case.
Imagine a Markov decision process (MDP) with the following attributes:
- a state space
S = {s_1}
with only one possible state, - an action space
A = {a_1}
with a singular possible action, - a reward function
R: S X A X S → ℝ
withR(s_1, a_1, s_1) = 4
- and finally a state transition function
T: S X A X S → [0,1]
which produces probability 1 for all actionsA
and state transitions from and to states_1
Now assume that we have a single agent which has been initialized using optimistic initialization. For all possible states and actions we set the Q-value equal to 5 (i.e. Q(s_1, a_1) = 5
). Q-values will be updated using the Bellman equation:
Q(S,A) := Q(S,A) + α( R + γQ(S',A') - Q(S,A) )
Here α and γ are chosen such that α = (0,1]
and γ = (0,1]
. Notice that we will require α and γ to be non-zero.
When the agent selects its action (a_1
) in state s_1
, the update formula becomes:
Q(s_1, a_1) := 5 + α( 4 + γ5 - 5 )
Notice that the Q-value does not change when γ5 = 1
, or more generally when γQ(S,A) = Q(S,A) - R
. Also, the Q-value will increase when γQ(S,A) > Q(S,A) - R
, which would further increase the difference between the actual state-action value and the expected state-action value.
This seems to indicate that in some cases, it is possible for the difference between the actual and expected state-action values to increase over time. In other words, it is possible for the expected value to diverge from the actual value.
If we were to initialize the Q-values to equal 0 for all states and actions, we surely would not end up in this situation. However, I do believe it possible that a stochastic reward/transition function may cause the agent to over estimate its state-action values in a similar fashion, causing the above behavior to take effect. This would require a highly improbable situation where the MDP transitions to a high payoff state often, even though this transition has a very low likelihood.
Perhaps there are any assumptions I made here that actually do not hold. Maybe the target goal is not to precisely estimate the true state-action value, but rather that convergence to optimal state-action values is sufficient. That being said, I do find it rather odd that the divergence behavior between actual and expected returns is possible.
Any thoughts on this would be appreciated.