What is the difference between policy gradient methods and neural network-based action-value methods?

Question

score 2 · Accepted Answer · edited Dec 09 '18 at 22:32

We need to differentiate between "action selection" and "action-value estimation".

Action-value (denoted by Q(s, a)) estimation consists in calculating some sort of "score" (often called the "expected future reward") for a particular action a in a given state s. We just estimate this value Q(s, a), but we still don't know what action we will take.

Then, there is an action selection, which is a function f which, based on some information, returns an action we perform.

A broad class named as action-value methods are "action selection" methods, which, when given an action-value estimates (scores) Q, give us an action to perform. An example of such method is epsilon-greedy method. This method with probability 1 - epsilon picks an action with highest action-value score and with a probability of epsilon (which is usually a small number) picks an action at random. The only information we utilize are the Q scores.

Policy gradient methods perform action selection. The information we give to f is the current state s and some parameters theta: f(s, theta) We can imagine these parameters theta to be weights of a neural network. So, in practice, we would set the weights of a neural network to the values of theta, give the network state s as an input and get an action a as output. This is just one example of what policy gradient method may look like. We don't need any state-value or action-value estimates to get the policy. Furthermore, the function f must be differentiable.

Actor-Critic methods also perform action selection. The difference from policy gradient methods is that the function f also accepts the action-value estimates, i.e. Q, as input: f(s, theta, Q). We need action-value estimates to get action.

You can read more about the differences in "Reinforcement Learning: An Introduction" by Sutton and Barto in Chapter 13: Policy Gradient Methods.

Thanks but i don't get it. In DeepMind's paper about playing Atari with deep reinforcement learning, We give just screen images as an input and the network returns 8 actions Q-values, then we choose the biggest one. in this form, We are not giving available actions to network, we just give current state! Im not talking about Q-Value estimation. In a very large and continuous space of problems we can't give action-value pairs to network. Deepmind gives just state and get proper action to take. In Gradient Policy methods we do same thing. but what is the difference between these two. — Fcoder, May 05 '18 at 14:53
Okay, I apologize for long delay. I also got confused. The method they implement in the paper is NOT a policy gradient method. They use the neural network as a parametric approximation of the action values (i.e. action-values estimator) and then they perform action selection purely on these action values approximations (epsilon-greedy). The neural network is parametrized by some `w`, receives a state `s` (which in their case is history of 4 frames) and **produces action value estimates** q(s,a,w) for all actions `a` for one particular state `s`. — Aechlys, May 05 '18 at 16:32

What is the difference between policy gradient methods and neural network-based action-value methods?

1 Answers1