0

I'm setting up a R-table with (255 states, 4 actions). How do I input it from R-table (15, 15)?

I have created R-table (15, 15), but turn out I have to make R-table (225, 4) for the homework.

r_matrix = np.array([
[-1, -2, -3, -2, -3, -3, -4, -1, -4, -2, -1, -2, -3, -3, 500],
[-1, -3, -1, -2, -4, -1, -4, -1, -4, -2, -4, -2, -2, -2, -1],
[-4, -2, -1, -4, -2, -1, -2, -4, -2, -3, -2, -1, -2, -4, -4],
[-4, -2, -4, -1, -3, -2, -3, -2, -4, -2, -4, -1, -2, -4, -2],
[-4, -2, -2, -3, -2, -3, -1, -1, -4, -2, -1, -3, -4, -2, -4],
[-4, -3, -3, -4, -2, -3, -4, -2, -2, -1, -1, -2, -1, -2, -1],
[-2, -3, -2, -1, -1, -3, -2, -1, -4, -3, -1, -1, -2, -3, -3],
[-3, -1, -1, -4, -4, -3, -1, -2, -3, -1, -1, -4, -4, -3, -3],
[-3, -1, -4, -2, -3, -3, -1, -4, -4, -4, -2, -2, -2, -2, -1],
[-3, -4, -4, -2, -3, -4, -3, -3, -2, -2, -3, -4, -3, -4, -1],
[-3, -4, -1, -1, -1, -4, -4, -4, -4, -1, -2, -4, -2, -2, -1],
[-1, -3, -3, -3, -3, -3, -3, -3, -4, -1, -2, -4, -1, -2, -4],
[-2, -2, -1, -2, -2, -2, -4, -3, -1, -4, -1, -4, -2, -2, -2],
[-2, -1, -3, -1, -4, -4, -1, -3, -3, -1, -1, -2, -3, -4, -3],
[-2, -2, -1, -4, -4, -4, -2, -2, -3, -1, -2, -2, -1, -1, -3]
])

# Result (Up, Right, Down, Left)
r_matrix2 = np.array(
    [None, -2, -1, None],
    [None, -3, -3, -1],
    [None, -2, -1, -2],
    [None, -3, -2, -3],
    [None, -3, -4, -2],
    ...
)

Thank you

Try
  • 41
  • 2
  • 9
  • How do you intend to turn 225 elements `(15, 15)` into 900 `(225, 4)`? – gmds Apr 27 '19 at 23:36
  • Sorry that I wasn't clear. Basically, the teacher ask us to create R-table for Q-Learning. He gives us 225 elements (15, 15). The R-table will be (255 states, 4 actions). – Try Apr 27 '19 at 23:39
  • Okay, so where do the actions come from? – gmds Apr 27 '19 at 23:41
  • Up, Right, Down, Left from the current state. Basically if I was on state (0, 0), it will be [None, -2, -1, None] – Try Apr 27 '19 at 23:42
  • Each value corresponds to a reward value for an action taken from a state, right? Where do the *values* come from? – gmds Apr 27 '19 at 23:43

1 Answers1

0

To learn a policy that maximizes reward you apparently want to do Reward Backpropagation (or Value Iteration) for 225 location vertices with (symmetric) in-degree 4.

(BTW, you twice mentioned 255 where I think you meant 225.)

Arbitrarily define bad as -1000; substitute None values with that "negative infinity" reward.

There is a modeling detail at the Goal node containing reward of 500: ensure that all four outward edges have bad reward, so an agent won't be tempted to go there and then follow a cycle that permits unending collection of the 500 reward.

diameter

Compute the diameter of your grid-world graph. By inspection, trivially, it is the Manhattan distance of 28, twice 15 - 1. For arbitrary graphs you may need a shortest path algorithm to determine this.

init

Initialize the value of all non-Goal location vertices to bad.

iteration

For v in all location vertices, remember previous value val. Then chase the outward edges to an adjacent location, find the (typically negative) reward of traversing the edge, and store the updated value of v as max(val, val + reward), evaluated across all four edges.

The interpretation is: if we land on a location having value val, we are confident that by following an optimal policy we could collect val reward points.

This concludes an iteration. Repeat, for diameter iterations. After the first one, we will have updated just three values, to 497, 498, & 499. After the second at least five more values will be updated, and so on, gradually whittling away until no more bad values remain.

traversal

Now traversing the graph from Start to Goal is straightforward. Across all four out-edges, simply follow the edge that leads to the highest-value location, and repeat until encountering the Goal.

Community
  • 1
  • 1
J_H
  • 17,926
  • 4
  • 24
  • 44