3

I'm using TF-Agents library for reinforcement learning, and I would like to take into account that, for a given state, some actions are invalid.

How can this be implemented?

Should I define a "observation_and_action_constraint_splitter" function when creating the DqnAgent?

If yes: do you know any tutorial on this?

MarcoM
  • 1,093
  • 9
  • 25

1 Answers1

2

Yes you need to define the function, pass it to the agent and also appropriately change the environment output so that the function can work with it. I am not aware on any tutorials on this, however you can look at this repo I have been working on.

Note that it is very messy and a lot of the files in there actually are not being used and the docstrings are terrible and often wrong (I forked this and didn't bother to sort everything out). However it is definetly working correctly. The parts that are relevant to your question are:

  • rl_env.py in the HanabiEnv.__init__ where the _observation_spec is defined as a dictionary of ArraySpecs (here). You can ignore game_obs, hand_obs and knowledge_obs which are used to run the environment verbosely, they are not fed to the agent.

  • rl_env.py in the HanabiEnv._reset at line 110 gives an idea of how the timestep observations are constructed and returned from the environment. legal_moves are passed through a np.logical_not since my specific environment marks legal_moves with 0 and illegal ones with -inf; whilst TF-Agents expects a 1/True for a legal move. My vector when cast to bool would therefore result in the exact opposite of what it should be for TF-agents.

  • These observations will then be fed to the observation_and_action_constraint_splitter in utility.py (here) where a tuple containing the observations and the action constraints is returned. Note that game_obs, hand_obs and knowledge_obs are implicitly thrown away (and not fed to the agent as previosuly mentioned.

  • Finally this observation_and_action_constraint_splitter is fed to the agent in utility.py in the create_agent function at line 198 for example.

  • Thanks for your answer Federico! Just a question: here (https://github.com/tensorflow/agents/issues/255) a 1 is passed for a valid action ("[...] a tensor that has a 1 for each allowed action and 0 for not allowed"). It's just the opposite of what you wrote: who's right? – MarcoM Dec 11 '20 at 18:02
  • 1
    @MarcoM Yes you are absolutely right. It's been a while since I has last worked on this repo and I forgot that my the legal_moves before the logical_not are actually a vector of 0 and -inf where 0 marks legal moves.... That's why I have to negate them in my case. Sorry about that, I edited my comment for future viewers – Federico Malerba Dec 12 '20 at 08:46
  • This answer applies to the Q-learning agents, but not other ones in the TF Agent library such as ppo_agent (accessed 20210909). I wonder if that is an oversight by the library authors, or whether there is something intrinsically different between the algorithms that prevents implementation of the same mechanism. – Setjmp Sep 09 '21 at 15:17