State dependent action set in reinforcement learning

Question

How do people deal with problems where the legal actions in different states are different? In my case I have about 10 actions total, the legal actions are not overlapping, meaning that in certain states, the same 3 states are always legal, and those states are never legal in other types of states.

I'm also interested in see if the solutions would be different if the legal actions were overlapping.

For Q learning (where my network gives me the values for state/action pairs), I was thinking maybe I could just be careful about which Q value to choose when I'm constructing the target value. (ie instead of choosing the max, I choose the max among legal actions...)

For Policy-Gradient type of methods I'm less sure of what the appropriate setup is. Is it okay to just mask the output layer when computing the loss?

score 4 · Answer 1 · answered Aug 12 '20 at 03:06

There are two closely related works in recent two years:

[1] Boutilier, Craig, et al. "Planning and learning with stochastic action sets." arXiv preprint arXiv:1805.02363 (2018).

[2] Chandak, Yash, et al. "Reinforcement Learning When All Actions Are Not Always Available." AAAI. 2020.

score 3 · Answer 2 · answered May 10 '18 at 20:08

Currently this problem seems to not have one, universal and straight-forward answer. Maybe because it is not that of an issue?

Your suggestion of choosing the best Q value for legal actions is actually one of the proposed ways to handle this. For policy gradients methods you can achieve similar result by masking the illegal actions and properly scaling up the probabilities of the other actions.

Other approach would be giving negative rewards for choosing an illegal action - or ignoring the choice and not making any change in the environment, returning the same reward as before. For one of my personal experiences (Q Learning method) I've chosen the latter and the agent learned what he has to learn, but he was using the illegal actions as a 'no action' action from time to time. It wasn't really a problem for me, but negative rewards would probably eliminate this behaviour.

As you see, these solutions don't change or differ when the actions are 'overlapping'.

Answering what you've asked in the comments - I don't believe you can train the agent in described conditions without him learning the legal/illegal actions rules. This would need, for example, something like separate networks for each set of legal actions and doesn't sound like the best idea (especially if there are lots of possible legal action sets).

But is the learning of these rules hard?

You have to answer some questions yourself - is the condition, that makes the action illegal, hard to express/articulate? It is, of course, environment-specific, but I would say that it is not that hard to express most of the time and agents just learn them during training. If it is hard, does your environment provide enough information about the state?

score 1 · Answer 3 · answered Apr 25 '18 at 09:29

1

Not sure if I understand your question correctly, but if you mean that in certain states some actions are impossible then you simply reflect it in the reward function (big negative value). You can even decide to end the episode if it is not clear what state would the illegal action result in. The agent should then learn that those actions are not desirable in the specific states.

In exploration mode, the agent might still choose to take the illegal actions. However, in exploitation mode it should avoid them.

answered Apr 25 '18 at 09:29

Jan K

4,040
1
15
16

See my comment to the other answer. You are suggesting that the agent has to learn the rules of the game as well, which makes the problem harder. Are there ways to somehow provide that information to the agent without it having to learn it? – Edmonds Karp Apr 26 '18 at 01:45

Bert Kellerman · Answer 4 · 2018-04-25T22:22:05.503

0

I recently built a DDQ agent for connect-four and had to address this. Whenever a column was chosen that was already full with tokens, I set the reward equivalent to losing the game. This was -100 in my case and it worked well.

In connect four, allowing an illegal move (effectively skipping a turn) can in some cases be advantageous for the player. This is why I set the reward equivalent to losing and not a smaller negative number.

So if you set the negative reward greater than losing, you'll have to consider in your domain what are the implications of allowing illegal moves to happen in exploration.

edited Apr 25 '18 at 22:22

answered Apr 25 '18 at 22:16

Bert Kellerman

1,590
10
17

But that seems to suggest that the agent has to learn the rules of the game as well, which makes the problem harder. Are there ways to somehow provide that information to the agent without it having to learn it? – Edmonds Karp Apr 26 '18 at 01:44

State dependent action set in reinforcement learning

4 Answers4

Linked