As far as I understand Q-learning, a Q-value is a measure of "how good" a particular state-action pair is. This is usually represented in a table in one of the following ways (see fig.):
- Are both representations valid?
- How do you determine the best action if the Q-table is given as a state to state transition table (as shown in the top q-table in the figure), especially if the state transitions are not deterministic (i.e. taking an action from a state can land you in different states at different times?)