2

I've implemented the natural actor-critic RL algorithm on a simple grid world with four possible actions (up,down,left,right), and I've noticed that in some cases it tends to get stuck oscillating between up-down or left-right.

Now, in this domain up-down and left-right are opposites and feel that learning might be improved if I were somehow able to make the agent aware of this fact. I was thinking of simply adding a step after the action activations are calculated (e.g. subtracting the left activation from the right activation and vice versa). However, I'm afraid of this causing convergence issues in the general case.

It seems as so adding constraints would be a common desire in the field, so I was wondering if anyone knows of a standard method I should be using for this purpose. And if not, then whether my ad-hoc approach seems reasonable.

Thanks in advance!

zergylord
  • 4,368
  • 5
  • 38
  • 60
  • Can you elaborate on `subtracting the left activation from the right activation and vice versa`? – greeness Jan 31 '13 at 21:10
  • @greeness Sure thing :) Imagine the model outputs this activation for each action: up=.7,left=.9,down=.1,right=.8. Normally the most active action would be chosen (i.e. left). But I'd want the model to think that wanting to go left AND right means neither is a good option. I proposed to inform the model of this by altering the activation values to the following:up=.7,left=.9-.8=.1,down=.1,right=.8-.9=-.1. Now the model would select the up action since the left and right activation cancels out. – zergylord Jan 31 '13 at 23:54
  • The question remains as to 1) whether or not this scheme would actually work , 2)whether there some standard way of accomplishing the same thing. – zergylord Jan 31 '13 at 23:57
  • I don't think this cancellation is valid. Think of one situation: up=0.2, down=0.1,left=0.8,right=0.8, going up or down leads to a path without exit to a black hole, while going left or right can avoid that. If you use the cancellation, you will never find the right way, right? Did you try to use an epsilon-exploration method so that every action gets chance to be tried. – greeness Feb 01 '13 at 01:02
  • Hmm yeah, figured there was something obviously wrong with my idea; that would be it lol And yeah, I've been doing e-greedy with a fairly high epsilon (everywhere from 0 to 25%), but I don't that exploration is the problem. I think the problem is introduced by me having very fine grained actions. Going back to your black hole example, after turning left or right the agent hasn't completely cleared the hole and so much make pretty much the same decision again. So the agent ends up making a random walk along the left-right axis until he clears the hole. Can't figure out how to fix that :-/ – zergylord Feb 01 '13 at 02:24
  • Try to change the learning rate to be `alpha=1/(number of times you visited (state,action) so far)`. I have a post here. http://stackoverflow.com/questions/13148934/unbounded-increase-in-q-value-consequence-of-recurrent-reward-after-repeating-t/13153633#13153633 . Maybe it helps. – greeness Feb 01 '13 at 02:47

1 Answers1

2

I'd stay away from using heuristics in the selection of actions, if at all possible. If you want to add heuristics to your training, I'd do it in the calculation of the reward function. That way the agent will learn and embody the heuristic as a part of the value function it is approximating.

About the oscillation behavior, do you allow for the action of no movement (i.e. stay in the same location)?

Finally, I wouldn't worry too much about violating the general case and convergence guarantees. They are merely guidelines when doing applied work.

danelliottster
  • 345
  • 1
  • 15