So in Q learning, you update the Q function by Qnew(s,a) = Q(s,a) + alpha(r + gamma*MaxQ(s',a) - Q(s,a).
Now, if I were to use the same principle but change Q to V function, instead of performing the action based on the current V function, you actually perform all actions (assuming you can reset the simulated environment), and select the best action out of those, and update the V function for that state. Would this yield a better result?
Of course, the training time would probably increase because you actually do all the actions once for each update, but since you are guaranteed to select the best action each time (except when exploring), it would give you a global optimum policy in the end?
This is a bit similar to value iteration, except I'm don't have and not building a model for the problem.