how should i define the state for my gridworld like environment?

Question

The problem i want to solve is actually not this simple, but this is kind of a toy game to help me solve the greater problem.

so i have a 5x5 matrix with values all equal to 0 :

structure = np.zeros(25).reshape(5, 5)

and the goal is for the agent to turn all values into 1, so i have:

goal_structure = np.ones(25).reshape(5, 5)

i created a class Player with 5 actions to go either left, right, up, down or flip (turn the value 0 to 1 or 1 to 0). For the reward, if the agent changes the value 0 into 1, it gets a +1 reward. if it turns a 1 into 0 in gets a negative reward (i tried many values from -1 to 0 or even -0.1). and if it just goes left, right, up or down, it gets a reward 0.

Because i want to feed the state to my neural net, i reshaped the state as below:

reshaped_structure = np.reshape(structure, (1, 25))

and then i add the normalized position of the agent to the end of this array (because i suppose the agent should have a sense of where it is):

reshaped_state = np.append(reshaped_structure, (np.float64(self.x/4), np.float64(self.y/4)))
state = reshaped_state

but i dont get any good results! it just like its random!i tried different reward functions, different optimizing algorithms, such as Exeperience replay, target net, Double DQN, duelling, but non of them seem to work! and i guess the problem is with defining the state. Can any one maybe helping me with defining a good state?

Thanks a lot!

ps: this is my step function:

class Player:

def __init__(self):
    self.x = 0
    self.y = 0

    self.max_time_step = 50
    self.time_step = 0
    self.reward_list = []
    self.sum_reward_list = []
    self.sum_rewards = []

    self.gather_positions = []
    # self.dict = {}

    self.action_space = spaces.Discrete(5)
    self.observation_space = 27

def get_done(self, time_step):

    if time_step == self.max_time_step:
        done = True

    else:
        done = False

    return done

def flip_pixel(self):

    if structure[self.x][self.y] == 1:
        structure[self.x][self.y] = 0.0

    elif structure[self.x][self.y] == 0:
        structure[self.x][self.y] = 1

def step(self, action, time_step):

    reward = 0

    if action == right:

        if self.y < y_threshold:
            self.y = self.y + 1
        else:
            self.y = y_threshold

    if action == left:

        if self.y > y_min:
            self.y = self.y - 1
        else:
            self.y = y_min

    if action == up:

        if self.x > x_min:
            self.x = self.x - 1
        else:
            self.x = x_min

    if action == down:

        if self.x < x_threshold:
            self.x = self.x + 1
        else:
            self.x = x_threshold

    if action == flip:
        self.flip_pixel()

        if structure[self.x][self.y] == 1:
            reward = 1
        else:
            reward = -0.1



    self.reward_list.append(reward)

    done = self.get_done(time_step)

    reshaped_structure = np.reshape(structure, (1, 25))
    reshaped_state = np.append(reshaped_structure, (np.float64(self.x/4), np.float64(self.y/4)))
    state = reshaped_state

    return state, reward, done

def reset(self):

    structure = np.zeros(25).reshape(5, 5)

    reset_reshaped_structure = np.reshape(structure, (1, 25))
    reset_reshaped_state = np.append(reset_reshaped_structure, (0, 0))
    state = reset_reshaped_state

    self.x = 0
    self.y = 0
    self.reward_list = []

    self.gather_positions = []
    # self.dict.clear()

    return state

dilaudid · Accepted Answer · 2020-04-13T14:56:18.013

I would encode the agent position as a matrix like this:

(where the agent is in the middle). Of course you have to flatten this too for the network. So your total state is 50 input values, 25 for the cell states, and 25 for the agent position.

When you encode the position as two floats, then the network has to do work decoding the exact value of the floats. If you use an explicit scheme like the one above, it is very clear to the network exactly where the agent is. This is a "one-hot" encoding for position.

If you look at the atari DQN papers for example, the agent position is always explicitly encoded with a neuron for each possible position.

Note also that a very good policy for your agent is to stand still and constantly flip the state, it makes 0.45 reward per step for doing this (+1 for 0 to 1, -0.1 for 1 to 0, split over 2 steps). Assuming a perfect policy it can only make 25, but this policy will make a 22.5 reward and be very hard to unlearn. I would suggest that the agent gets a -1 for unflipping a good reward.

You mention that the agent is not learning. Might I suggest that you try to simplify as much as possible. First suggestion is - reduce the length of the episode to 2 or 3 steps, and reduce the size of the grid to 1. See if the agent can learn to consistently set the cell to 1. At the same time, simplify your agent's brain as much as possible. Reduce it to just a single output layer - a linear model with an activation. This should be very quick and easy to learn. If the agent does not learn this within 100 episodes, I suspect there is a bug in your RL implementation. If it works you can start to expand the size of the grid, and the size of the network.

Thank you very much for you help! i tried what you recommended. so i assigned a matrix for the position as you mentioned, flattened it and appended to the flattened configuration matrix. fed it to the network, and i still didnt get any good results. still seems its random and even gets worse after many iterations. :( — hosseinoj, Apr 13 '20 at 13:41
If the agent is failing to learn, there could be a lot of reasons for that - it could be that you aren't waiting long enough (typically it takes about a thousand games for an agent to learn even the simplest thing), it could be that your learning rate is too high or too low, or it could be a bug in your implementation of the learning algorithm. If you like you could post more details on what you are doing - is it SARSA or Q-learning, what model are you using (a deep neural network? How many layers/nodes). I have added two generic points to the main answer - tl:dr simplify as much as possible. — dilaudid, Apr 13 '20 at 14:08
I have tried to change my hyper parameters a bit and for the case 3x3 it worked! my gamma was 0.9999 and i changed it to 0.8. i hope the same strategy can help me with the actual project! it was a great help! thanks a lot! — hosseinoj, Apr 13 '20 at 18:41
Thanks - if you are happy with it, then please accept my answer! Also epsilon of 0.9999 is very high, normally 0.95-0.99 are the more usual values. — dilaudid, Apr 13 '20 at 19:42
yeah sure i did! sorry it was my first question on stackoverflow, i didnt know what even it is haha! — hosseinoj, Apr 14 '20 at 08:37

how should i define the state for my gridworld like environment?

1 Answers1