Q-learning, how about picking the action that actually gives most reward?

Question

So in Q learning, you update the Q function by Qnew(s,a) = Q(s,a) + alpha(r + gamma*MaxQ(s',a) - Q(s,a).

Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?

Of course, the training time would probably increase because you actually do all the actions once for each update, but since you are guaranteed to select the best action each time (except when exploring), it would give you a global optimum policy in the end?

This is a bit similar to value iteration, except I'm don't have and not building a model for the problem.

"Would this yield a better result?" What is the problem you are trying to solve or what are you trying to achieve? Not clear from the question! — Pannag Sanketi, Jun 08 '18 at 19:55

score 1 · Answer 1 · answered Jun 08 '18 at 08:18

Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?

It is typically assumed in Reinforcement Learning that we do not have the ability to reset the (simulated) environment. Sure, when we're working on simulations it often may technically be possible, but generally we hope that work in RL can also extend to "real-world" problems outside of simulations afterwards, where that would no longer be possible.

If you do have that possibility, it would generally be recommended to look into search algorithms like Monte-Carlo Tree Search, rather than Reinforcement Learning like Sarsa, Q-learning, etc. I suspect your suggestion might work slightly better than Q-learning indeed in this case, but things like MCTS would be even better.

but would MCTS learn to deal with new scenarios? I still want the AI to learn how to react to unseen inputs before, and thus only doing one step look ahead instead of going to the full depth to pick an action and update the network — Andy Wei, Jun 12 '18 at 03:17
@AndyWei Ah, no, that's fair. MCTS doesn't "learn" a policy and then store it for immediate execution later, it needs to be run over again whenever you want to pick an action. So, you assume that you do have the ability to reset the environment during training, but no longer afterwards during testing/deployment? I think your idea would work in that case, at least I see no problems intuitively. I'm not familiar with any 100% formal treatments of that idea though. — Dennis Soemers, Jun 12 '18 at 08:01

Pannag Sanketi · Answer 2 · 2018-06-09T13:44:30.863

Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?

Given that you don't have access to the model, you have to resort to model free methods. What you are suggesting is basically a Dynamics programming backup. See the slides 28 - 31 in David Silver's lecture notes for various backup strategies to iterate on the value function.

However, note that this is just for prediction (i.e. estimating the value function for a given policy) and not for control (figuring out the best policy). There won't be a Max involved in prediction. To do control, you can use the above policy evaluation + greedy policy improvement to arrive at a "policy iteration based on Dynamic prog backup policy evaluation" method.

The other options for model-free control are SARSA [+ greedy policy improvement] (on policy) and Q-learning (off-policy). These are Q-function based methods, though.

If you are just trying to win the game, and not necessarily interested in RL techniques discussed above, then you also have the choice of using purely planning based methods (like Monte Carlo Tree Search). Finally, you can combine planning and learning with methods such as Dyna.

Q-learning, how about picking the action that actually gives most reward?

2 Answers2