2

I'm still new to ML, recently I've learned Q-Learning and coded it manually (Not using a library like Keras or TensorFlow), and the problem I'm facing is knowing how to write a good reward function for my agent, I've started by writing the following simple reward function:

When moving from X, Y to X1, Y1: Return (Distance(from X,Y to Target) - Distance (from X1,Y1 to Target))

Which means it got positive reward whenever it moved towards the Target, and it worked fine on an empty 2D plain.

But when I added obstacles, that function did not help, the agent took the shortest path to the target getting stuck in obstacles forever, I added punishment for staying in place, and it got stuck against the wall again but this time going back and forth because the total of punishment + reward was 0, and it had already gotten a positive reward so this was the favourable path. I then added punishment for passing the same square twice, but still, I feel like this may have gotten too convoluted, and that there must be a simpler way to do this

Starting position (Green is the agent, Red is the target)

Getting stuck in the blocked shortest direct path

I realize there are multiple things I've understood/done wrong about the reward after reading about it a bit, from having my reward go up to 2k in one move, instead of being in the range [-1, 1], and not having a clear distinction of when to use Negative vs Positive reward.

My memory array of state vs action consists of n states where n=rows*columns, and 5 actions (up, right, down, left, stayinplace).

So, knowing that my agent is supposed to find the Shortest Available Path to the target (Not blocked), what should my reward function look like? and why? Also following the algorithm I learned from, they didn't really specify the values for Epsilon, Gamma and LearningRate, so I set them to 0.2, 0.85, 0.75 respectively.

My code is in python if you want to send the reward function in code.

PS: I searched up the problem on and off StackOverflow and all I found was references and articles, all of which explained what a reward function should do, but no detail on how to make it do that, or turn my query into a reward function.

Here's my code file on Github (No GUI): https://github.com/EjHam98/LearningMachineLearning/blob/master/QLearning.py

  • Try using: -1 reward form each step (which forces agent to learn shortest path), some more negative reward for hitting obstacles and some positive reward for reaching target – Girish Hegde Aug 17 '20 at 06:58
  • I tried returning -1 each step, or return -1*n where n is the number of steps so far, and both didn't work, I knew it wouldn't work because I need the shortest **Available** path, which isn't necessarily the shortest distance to the target, as shown in the example in the images, thus punishing a longer route means the agent is discouraged from the correct path because it's longer. Basically the Agent has to work similarly to backtrack – LieutenantDV20 Aug 17 '20 at 08:55
  • "punishing a longer route means the bot is discouraged from the correct path because it's longer" I don't think it is true. -1 reward stops it from repeating same steps and encourages agent to reach target as fast possible. As you are training DQN agent what you are considering as `observation` also matters a lot check that. – Girish Hegde Aug 17 '20 at 09:06
  • Better take 2d grid representing environment state as observation. And stack it with some previous observations to form Agents's state representation. And use CNN as function approximator. – Girish Hegde Aug 17 '20 at 09:16
  • I'm starting to think My code/understanding is horribly wrong, I've already studied CNN, but while studying Qlearning it had no mention of CNN, here's the link of what I followed with my code: https://towardsdatascience.com/simple-reinforcement-learning-q-learning-fcddc4b6fe56#:~:text=Q%2Dlearning%20is%20an%20off,a%20policy%20isn't%20needed. is it good? is it enough? – LieutenantDV20 Aug 17 '20 at 09:32
  • That means you are implementing Q-learning not DQN. But you mentioned Deep Q-learning in your question so I suggested CNN. If you want to implement it using classical Q-leaning you need very good Agent's state representation which should encode agent's position along with surrounding conditions. – Girish Hegde Aug 17 '20 at 09:40
  • Yeah sorry it appears I'm confused about the two, I meant Q-learning, and I added a link to my code in the question – LieutenantDV20 Aug 17 '20 at 09:46
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/219955/discussion-between-lieutenantdv20-and-girish-dattatray-hegde). – LieutenantDV20 Aug 17 '20 at 10:34

1 Answers1

0

In case of your environment state-action space is very large. Considering only just 10 obstacles the total states will be more than 49x48x47c10 which is more than 10e13 here not even actions and other possible number of obstacles are considered.

So it is better to use Deep Q-learning with power full CNN function aproximator.

  • Observation - 2d grid representing the maze(or image)

  • Agent's state - stack of current observation along with some previous frames(2, 3).

  • Reward structure

    • -1 for each time step
    • +ve reward for reaching target state
    • -ve reward for hitting obstacles

And better using Q-learning for simple environments(like OpenAI gym control environments). Here's sample implementation of q-learning for gym control environment.

Girish Hegde
  • 1,420
  • 6
  • 16
  • The way I have it set up is: State: current position as a single integer between 0 and rows*columns-1 (in this case 49 states Actions are 5 possible actions (up, down, right, left, stay) so my state-action table is just 49*5 But It's probably wrong since honestly I've doubted everything i know on this topic, also I thought this was a simple environment, it's just 2D with walls, and a single objective.. Can you please provide me on good learning resources for Q-Learning and DQN? and I'd prefer ones that explain the algorithm and its details and how it works in detail – LieutenantDV20 Aug 17 '20 at 10:31
  • 1
    storing only position won't work because agent needs to consider surroundings before taking action. For eg: In 1st episode there is a obstacle left to position `(3, 3)` but during next episode obstacle can be to the right of `3, 3`. Agent's position won't help to encode such situation. Refer [this](https://www.youtube.com/playlist?list=PLzuuYNsE1EZAXYR4FJ75jcJseBmo4KQ9-) awesome lecture's of David Silver. – Girish Hegde Aug 17 '20 at 10:40
  • Thanks for the help, appreciated! Time for me to properly learn – LieutenantDV20 Aug 17 '20 at 10:44