Optimal value of state-action by bellman optimal equation(63 page of sutton 2018) is
and Q-learning is
I have known that Q-learning is model-free. so It doesn't need a probability of transition for next state.
However, p(s'r|s,a) of bellman equation is probability of transition for next state s' with reward r when s, a are given. so I think to get a Q(s,a), it needs probability of transition.
Q of bellman equation and Q of q-learning is different?
If it is same, how q-learning can work as model-free?
Is there any way to get a Q(s,a) regardless of probability of transition for q-learning?
Or Am i confusing something?