1

I am working on learning q-tables and ran through a simple version which only used a 1-dimensional array to move forward and backward. now I am trying 4 direction movement and got stuck on controlling the person.

I got the random movement down now and it will eventually find the goal. but I want it to learn how to get to the goal instead of randomly stumbling on it. So I would appreciate any advice on adding a qlearning into this code. Thank you.

Here is my full code as it stupid simple right now.

import numpy as np
import random
import math

world = np.zeros((5,5))
print(world)
# Make sure that it can never be 0 i.e the start point
goal_x = random.randint(1,4)
goal_y = random.randint(1,4)
goal = (goal_x, goal_y)
print(goal)
world[goal] = 1
print(world)

LEFT = 0
RIGHT = 1
UP = 2
DOWN = 3
map_range_min = 0
map_range_max = 5

class Agent:
    def __init__(self, current_position, my_goal, world):
        self.current_position = current_position
        self.last_postion = current_position
        self.visited_positions = []
        self.goal = my_goal
        self.last_reward = 0
        self.totalReward = 0
        self.q_table = world


    # Update the totoal reward by the reward        
    def updateReward(self, extra_reward):
        # This will either increase or decrese the total reward for the episode
        x = (self.goal[0] - self.current_position[0]) **2
        y = (self.goal[1] - self.current_position[1]) **2
        dist = math.sqrt(x + y)
        complet_reward = dist + extra_reward
        self.totalReward += complet_reward 

    def validate_move(self):
        valid_move_set = []
        # Check for x ranges
        if map_range_min < self.current_position[0] < map_range_max:
            valid_move_set.append(LEFT)
            valid_move_set.append(RIGHT)
        elif map_range_min == self.current_position[0]:
            valid_move_set.append(RIGHT)
        else:
            valid_move_set.append(LEFT)
        # Check for Y ranges
        if map_range_min < self.current_position[1] < map_range_max:
            valid_move_set.append(UP)
            valid_move_set.append(DOWN)
        elif map_range_min == self.current_position[1]:
            valid_move_set.append(DOWN)
        else:
            valid_move_set.append(UP)
        return valid_move_set

    # Make the agent move
    def move_right(self):
        self.last_postion = self.current_position
        x = self.current_position[0]
        x += 1
        y = self.current_position[1]
        return (x, y)
    def move_left(self):
        self.last_postion = self.current_position
        x = self.current_position[0]
        x -= 1
        y = self.current_position[1]
        return (x, y)
    def move_down(self):
        self.last_postion = self.current_position
        x = self.current_position[0]
        y = self.current_position[1]
        y += 1
        return (x, y)
    def move_up(self):
        self.last_postion = self.current_position
        x = self.current_position[0]
        y = self.current_position[1]
        y -= 1
        return (x, y)

    def move_agent(self):
        move_set = self.validate_move()
        randChoice = random.randint(0, len(move_set)-1)
        move = move_set[randChoice]
        if move == UP:
            return self.move_up()
        elif move == DOWN:
            return self.move_down()
        elif move == RIGHT:
            return self.move_right()
        else:
            return self.move_left()

    # Update the rewards
    # Return True to kill the episode
    def checkPosition(self):
        if self.current_position == self.goal:
            print("Found Goal")
            self.updateReward(10)
            return False
        else:
            #Chose new direction
            self.current_position = self.move_agent()
            self.visited_positions.append(self.current_position)
            # Currently get nothing for not reaching the goal
            self.updateReward(0)
            return True


gus = Agent((0, 0) , goal)
play = gus.checkPosition()
while play:
    play = gus.checkPosition()

print(gus.totalReward)
MNM
  • 2,673
  • 6
  • 38
  • 73
  • Q is normally a function of state and action whereas here it is mapped one to one with the states only. I'd recommend that you have a mapping from 1D state representation to your xD state representation so that Q always only has 2 dimensions. – James Brusey Jun 21 '19 at 06:48
  • So like flatten the world (5x5) into a 1D array of length 25? – MNM Jun 21 '19 at 07:04
  • 1
    Yes - and then you need another dimension for actions. i.e, Q(s,a) – James Brusey Jun 21 '19 at 07:44
  • q_table = np.zeros((2,25)) – MNM Jun 21 '19 at 08:01
  • def __init__(self, current_position, my_goal, q_table): – MNM Jun 21 '19 at 08:02
  • self.q_table = q_table – MNM Jun 21 '19 at 08:02
  • so kind of like this? – MNM Jun 21 '19 at 08:02
  • Roughly speaking - yes. I'm not sure which algorithm you are trying to implement. Your code does not seem to update or use Q. A good starting point will be to try something like Monte Carlo with Exploring Starts. This means that you want to choose actions randomly for the first step and then greedily with respect to Q from then on. Greedy means that you find the action that has the largest value for the current state. So you need to do a slice through your Q array (by the current state) and then find the argmax. Also see https://gym.openai.com/ and https://github.com/openai/baselines – James Brusey Jun 21 '19 at 09:02
  • 1
    It's just occurred to me that you might not be able to solve this problem using RL. You have an unknown goal location that keeps changing. The issue is that your state representation is not Markovian. One way to fix this is to have, as part of the state, the relative location of the goal. – James Brusey Jun 23 '19 at 18:21

1 Answers1

3

I have a few suggestions based on your code example:

  1. separate the environment from the agent. The environment needs to have a method of the form new_state, reward = env.step(old_state, action). This method is saying how an action transforms your old state into a new state. It's a good idea to encode your states and actions as simple integers. I strongly recommend setting up unit tests for this method.

  2. the agent then needs to have an equivalent method action = agent.policy(state, reward). As a first pass, you should manually code an agent that does what you think is right. e.g., it might just try to head towards the goal location.

  3. consider the issue of whether the state representation is Markovian. If you could do better at the problem if you had a memory of all the past states you visited, then the state doesn't have the Markov property. Preferably, the state representation should be compact (the smallest set that is still Markovian).

  4. once this structure is set-up, you can then think about actually learning a Q table. One possible method (that is easy to understand but not necessarily that efficient) is Monte Carlo with either exploring starts or epsilon-soft greedy. A good RL book should give pseudocode for either variant.

When you are feeling confident, head to openai gym https://www.gymlibrary.dev/ for some more detailed class structures. There are some hints about creating your own environments here: https://www.gymlibrary.dev/content/environment_creation/

James Brusey
  • 355
  • 3
  • 11
  • I'm starting to think your right I need to rework this whole project from the ground up. Thank you for your advice. – MNM Jun 25 '19 at 00:07