What is the difference between policy gradient methods and neural network-based action-value methods?
1 Answers
We need to differentiate between "action selection" and "action-value estimation".
Action-value (denoted by Q(s, a)
) estimation consists in calculating some sort of "score" (often called the "expected future reward") for a particular action a
in a given state s
. We just estimate this value Q(s, a)
, but we still don't know what action we will take.
Then, there is an action selection, which is a function f
which, based on some information, returns an action we perform.
A broad class named as action-value methods are "action selection" methods, which, when given an action-value estimates (scores) Q
, give us an action to perform. An example of such method is epsilon-greedy method. This method with probability 1 - epsilon
picks an action with highest action-value score and with a probability of epsilon
(which is usually a small number) picks an action at random. The only information we utilize are the Q scores.
Policy gradient methods perform action selection. The information we give to f
is the current state s
and some parameters theta
: f(s, theta)
We can imagine these parameters theta
to be weights of a neural network. So, in practice, we would set the weights of a neural network to the values of theta
, give the network state s
as an input and get an action a
as output. This is just one example of what policy gradient method may look like. We don't need any state-value or action-value estimates to get the policy. Furthermore, the function f
must be differentiable.
Actor-Critic methods also perform action selection. The difference from policy gradient methods is that the function f
also accepts the action-value estimates, i.e. Q
, as input: f(s, theta, Q)
. We need action-value estimates to get action.
You can read more about the differences in "Reinforcement Learning: An Introduction" by Sutton and Barto in Chapter 13: Policy Gradient Methods.
-
Thanks but i don't get it. In DeepMind's paper about playing Atari with deep reinforcement learning, We give just screen images as an input and the network returns 8 actions Q-values, then we choose the biggest one. in this form, We are not giving available actions to network, we just give current state! Im not talking about Q-Value estimation. In a very large and continuous space of problems we can't give action-value pairs to network. Deepmind gives just state and get proper action to take. In Gradient Policy methods we do same thing. but what is the difference between these two. – Fcoder May 05 '18 at 14:53
-
2Okay, I apologize for long delay. I also got confused. The method they implement in the paper is NOT a policy gradient method. They use the neural network as a parametric approximation of the action values (i.e. action-values estimator) and then they perform action selection purely on these action values approximations (epsilon-greedy). The neural network is parametrized by some `w`, receives a state `s` (which in their case is history of 4 frames) and **produces action value estimates** q(s,a,w) for all actions `a` for one particular state `s`. – Aechlys May 05 '18 at 16:32