3

I'm using neural network and tensorflow to for reinforcement learning on various stuff with Q learning method, and I want to know what is the solution to reduce the outputs possibilities when a specific action corresponding to a specific output isn't realisable in the environment at a specific state.

For example, my network is learning to play a game in which 4 actions are performed. But there is a specific state in which action 1 isn't performable in the environment but my neural network Q values indicate me that action 1 is the best thing to do. What do I have to do in this situation?

(Is just chosing a random valid action the best way to counter this problem ?)

Ashwin Geet D'Sa
  • 6,346
  • 2
  • 31
  • 59
Xeyes
  • 583
  • 5
  • 25
  • 1
    I'd say you should select the highest scoring valid action. Alternatively you can penalize it for choosing an invalid action through the reward, like how you automatically lose in tournament chess if you make an illegal move. Curious to see how this one turns out. – Stefan Dragnev May 16 '19 at 13:53
  • Possible duplicate of [Limit neural network output to subset of trained classes](https://stackoverflow.com/questions/44147764/limit-neural-network-output-to-subset-of-trained-classes) – shunyo May 16 '19 at 15:09
  • Thank for your answer Stefan, i think that firstly i'll try to minimize an error between my real Q values and 0 for all invalid movements in a particular state because finaly the Q value of this move should be 0 – Xeyes May 17 '19 at 10:31

1 Answers1

2

You should just ignore the invalid action(s), and select the action with the highest Q-value among the valid actions. Then, in the train step, you either multiply the Q-values by the one-hot-encode of the actions, or use gather_nd API to select the right Q-value, to obtain the loss and run a single gradient update. In other words, the loss of the invalid action(s) and all other non-selected actions are assumed zero and then the gradients are updated.

In this way, the network gradually learns to increase the Q-value of the right action, since only the gradient of that action is getting updated.

I hope this answers your question.

Afshin Oroojlooy
  • 1,326
  • 3
  • 21
  • 43
  • 1
    thank ! yeah since i made this topic i tried to manually put the target Q values for all invalids actions to 0 and it works pretty good ! – Xeyes May 21 '19 at 07:16