Deep Value-only Reinforcement Learning: Train V(s) instead of Q(s,a)?

Question

Is there a value-based (Deep) Reinforcement Learning RL algorithm available that is centred fully around learning only the state-value function V(s), rather than to the state-action-value function Q(s,a)?

If not, why not, or, could it easily be implemented?

Any implementations even available in Python, say Pytorch, Tensorflow or even more high-level in RLlib or so?

I ask because

I have a multi-agent problem to simulate where in reality some efficient centralized decision-making that (i) successfully incentivizes truth-telling on behalf of the decentralized agents, and (ii) essentially depends on the value functions of the various actors i (on V_i(s_i,t+1) for the different achievable post-period states s_i,t+1 for all actors i), defines the agents' actions. From an individual agents' point of view, the multi-agent nature with gradual learning means the system looks non-stationary as long as training is not finished, and because of the nature of the problem, I'm rather convinced that learning any natural Q(s,a) function for my problem is significantly less efficient than learning simply the terminal value function V(s) from which the centralized mechanism can readily derive the eventual actions for all agents by solving a separate sub-problem based on all agents' values.
The math of the typical DQN with temporal difference learning seems to naturally be adaptable a state-only value based training of a deep network for V(s) instead of the combined Q(s,a). Yet, within the value-based RL subdomain, everybody seems to focus on learning Q(s,a) and I have not found any purely V(s)-learning algos so far (other than analytical & non-deep, traditional Bellman-Equation dynamic programming methods).

I am aware of Dueling DQN (DDQN) but it does not seem to be exactly what I am searching for. 'At least' DDQN has a separate learner for V(s), but overall it still targets to readily learn the Q(s,a) in a decentralized way, which seems not conducive in my case.

Just curiosity, when you say the "terminal value function V(s)", what "terminal" means in this context? (I'm not familiar with the multi-agent setting). Thanks! — Pablo EM, Mar 24 '20 at 16:47
@Pablo EM: Actually maybe I should remove 'terminal' here; I'm calling it terminal value as for finding the period-t actions, we tend to use (mainly or maybe exclusively) the post-t `V(s_{t+1})` rather than `V(s_t)` (or maybe because I'm coming from economics where we always solve Bellman Eqs. calling V the terminal value function). The term was not mainly used because of the multi-agent setting. — FlorianH, Mar 24 '20 at 19:54
Ok, thanks for the info Florian. Once you have the value function V(s), how can you derive the actions? As you said in the question, I think it should be possible (and easy?) to adapt DQN to learn V(s) instead of Q(s,a). However, usually having V(s) is not enough to choose actions (unless you have a model of your environment), so the practical application of learning V(s) is much more limited than Q(s,a). Perhaps that is the only reason why most of the literature focuses on Q(s,a). — Pablo EM, Mar 25 '20 at 09:46
Thanks Pablo, that makes sense what you write; I may simply end up writing some custom algo myself; pbly a good exercise for newbie me. From what I've seen, this may be slightly more natural in e.g. pytorch than in rllib which so far feels bit more high-level and mostly thought to apply existing RL algos rather than to build own ones, even if I think it may be readily possible in rllib too. — FlorianH, Mar 25 '20 at 19:18
And to your question Pablo, how I derive the actions: I can not explain all the details, but maybe this can give the intuition: Essentially there is an efficient algo to derive, centrally and without learning _exactly_ the actions the actors would choose, and the corresponding inter-dependent rewards, as a function of all actors' individual terminal values V(s_{t+1}) for all reachable states s_{t+1}; at least in the theoretical model I try to build. Hence, it seems to me in this case efficient to learn really exactly the value function, rather than Q(s,a). — FlorianH, Mar 25 '20 at 19:36
I don't know rllib, but as you said, probably pytorch is a good framework to experiment and implement your algorithms. Many thanks for the explanation, I didn't get it completely but enough for my purposes. Good luck with your project FlorianH! — Pablo EM, Mar 26 '20 at 08:03
One solution that I now found and may bring me close to the aim, without having to implement everything from scratch: two-headed networks, one head being for policy choice, one for value estimation. ray RLlib apparently allows it with parameter `vf_share_layers`, and pytorch may readily allow it too, Cf. https://www.datahubbs.com/two-headed-a2c-network-in-pytorch/ and https://towardsdatascience.com/ray-and-rllib-for-fast-and-parallel-reinforcement-learning-6d31ee21c96c. — FlorianH, Apr 15 '20 at 15:29

Deep Value-only Reinforcement Learning: Train V(s) instead of Q(s,a)?

0 Answers0