How do apply Q-learning to an OpenAI-gym environment where multiple actions are taken at each time step?

Question

I have successfully used Q-learning to solve some classic reinforcement learning environments from OpenAI Gym (i.e. Taxi, CartPole). These environments allow for a single action to be taken at each time step. However I cannot find a way to solve problems where multiple actions are taken simultaneously at each time step. For example in the Roboschool Reacher environment, 2 torque values - one for each axis - must be specify at each time step. The problem is that the Q matrix is built from (state, action) pairs. However, if more than one action are taken simultaneously, it is not straightforward to build the Q matrix.

The book "Deep Reinforcement Learning Hands-On" by Maxim Lapan mentions this but does not give a clear answer, see quotation below.

Of course, we're not limited to a single action to perform, and the environment could have multiple actions, such as pushing multiple buttons simultaneously or steering the wheel and pressing two pedals (brake and accelerator). To support such cases, Gym defines a special container class that allows the nesting of several action spaces into one unified action.

Does anybody know how to deal with multiple actions in Q learning?

PS: I'm not talking about the issue "continuous vs discrete action space", which can be tackled with DDPG.

score 2 · Answer 1 · answered Apr 05 '19 at 17:47

You can take one of two approaches - depend on the problem:

Think of the set of actions you need to pass to the environment as independent and make the network output actions values for each one (make softmax separately) - so if you need to pass two actions, the network will have two heads, one for each axis.
Think of them as dependent and look on the Cartesian product of the sets of actions, and then make the network to output value for each product - so if you have two actions that you need to pass and 5 options for each, the size of output layer will be 2*5=10, and you just use softmax on that.

Thanks for your answer. At the moment, I'm only using Q-learning (by updating the Q values in a Q matrix) but I'm not using DQN or any network yet so I cannot apply these solutions. I'll give it a try using DQN but it sound like it should work. — Pierre, Apr 08 '19 at 14:39

How do apply Q-learning to an OpenAI-gym environment where multiple actions are taken at each time step?

1 Answers1