3

I am trying to implement the Q-Learning. The general algorithm from here is as below

enter image description here

In the statement

enter image description here

I just don't get it that should i implement the above statement of the original pseudo-code recursively for all next states which current state/action can lead us to and max it every time

OR just choose the maximum value of the next state with current action from the Action-State Q-Value table?

Thanks in advance.

dariush
  • 3,191
  • 3
  • 24
  • 43

1 Answers1

2

All the formula says is that on step t+1 you update the state-action value by using the state-action value from step t and the maximum of values over all the actions for the current state.

Don Reba
  • 13,814
  • 3
  • 48
  • 61
  • Note that you are actually maximizing over actions applied to the state s_(t+1), so I would probably describe this as maximum of values over all the actions for the _next_ state – Peter de Rivaz Dec 04 '14 at 14:34
  • I call `t+1` the current state, because this is the one that was reached upon receiving the reward. – Don Reba Dec 04 '14 at 15:03