Is there a value-based (Deep) Reinforcement Learning RL algorithm available that is centred fully around learning only the state-value function V(s)
, rather than to the state-action-value function Q(s,a)
?
If not, why not, or, could it easily be implemented?
Any implementations even available in Python, say Pytorch, Tensorflow or even more high-level in RLlib or so?
I ask because
I have a multi-agent problem to simulate where in reality some efficient centralized decision-making that (i) successfully incentivizes truth-telling on behalf of the decentralized agents, and (ii) essentially depends on the value functions of the various actors i (on Vi(si,t+1) for the different achievable post-period states si,t+1 for all actors i), defines the agents' actions. From an individual agents' point of view, the multi-agent nature with gradual learning means the system looks non-stationary as long as training is not finished, and because of the nature of the problem, I'm rather convinced that learning any natural Q(s,a) function for my problem is significantly less efficient than learning simply the terminal value function V(s) from which the centralized mechanism can readily derive the eventual actions for all agents by solving a separate sub-problem based on all agents' values.
The math of the typical DQN with temporal difference learning seems to naturally be adaptable a state-only value based training of a deep network for V(s) instead of the combined Q(s,a). Yet, within the value-based RL subdomain, everybody seems to focus on learning Q(s,a) and I have not found any purely V(s)-learning algos so far (other than analytical & non-deep, traditional Bellman-Equation dynamic programming methods).
I am aware of Dueling DQN (DDQN) but it does not seem to be exactly what I am searching for. 'At least' DDQN has a separate learner for V(s)
, but overall it still targets to readily learn the Q(s,a)
in a decentralized way, which seems not conducive in my case.