AI Player is not performing well? why?

Question

I am trying to implement an agent which uses Q-learning to play Ludo. I've trained it with an e-greedy action selector, with an epsilon of 0.1, and a learning rate of 0.6, and discount factor of 0.8.

I ran the game for around 50K steps, and haven't won a single game. This is puzzling as the Q table seem to be pretty accurate with what I want it to be. Why am I losing so much to random players? Shouldn't the system be able to win if the Q-table isn't changing that much, and in general how many iterations would I have to train my agent?

I am not sure how much information is needed, I will update the post with relevant information if needed.

Possible states, represented as rows in the Q-table:

In home
On globe
On a star
In goal
On the Winner Road
In safety with same colored player
On free space

Possible actions, represented as columns for each state:

Move out from home
Get in goal
Move to globe
Move to star
Move to goal via star
Get into safety with an same colored token
Get into the Winner Road
Suicide if the opponent is on a globe
Kill opponent
Just move
No move possible

I start by initializing my Q-table with random values, and end with a table that looks like this after 5000 iteration:

-21.9241  345.35 169.189 462.934 308.445 842.939 256.074  712.23 283.328 137.078   -32.8
398.895   968.8 574.977 488.216 468.481 948.541  904.77 159.578 237.928 29.7712 417.599
1314.25 756.426 333.321  589.25 616.682 583.632  481.84 457.585  683.22 329.132 227.329
1127.58 1457.92 1365.58 1429.26 1482.69 1574.66 1434.77 1195.64 1231.01 1232.07    1068
807.592 1070.17  544.13 1385.63 883.123 1662.97  524.08 966.205 1649.67 509.825 909.006
225.453 1141.34 536.544 242.647 1522.26 1484.47 297.704 993.186 589.984  689.73 1340.89
1295.03 310.461 361.776 399.866 663.152 334.657 497.956  229.94 294.462 311.505 1428.26

My immediate reward is based on how far each token is in the game multiplied with constant 10, after an action has been performed. Home position has position -1 and goal position has position 99. and all position in-between has positions between 0 - 55. If a Token is in goal, will a extra reward +100 be added to the immediate reward for each token in goal.

Usually, my player moves always one token to the goal... and thats it.

If wanted can a link to my repo be added if you think the code might not be accurate. — Lamda, May 24 '16 at 14:16

score 1 · Accepted Answer · answered May 26 '16 at 03:26

Why I am losing so much to random players? Shouldn't the system be able to win if the Q-table isn't changing that much?

It could be a bug in your Q-learning implementation. You say that the values in the learned Q-table are what close to what you expect though. If the values are converging, then I think it's less likely to be a bug and more likely that...

Your agent is doing the best it can given the state representation.

Q-table entries converge to the optimal value for taking an action in a given state. For this "optimal policy" to actually translate to what we would call good Ludo playing, the states the agent learns on need to directly correspond to the states of the board game. Looking at your states, you can see multiple arrangements of pieces on the board that map to the same state. For instance, if you are allowing players to have multiple tokens, the state space does not represent the position of all of them (neither does the actionspace). This could be why you are observing that the agent only moves one token then stops: it can't see that it has any other actions to take, because it believes that it's done! To give another example of how this is a problem, notice that the agent may want to take different actions depending on the position of the opponent's pieces, so to play optimally, the agent needs this information too. This information needs to be included in your state representation.

You could start adding rows to the Q-table, but here's the problem you'll run into: there are too many possible states in Ludo to feasibly learn tabularly (using a Q-table). The size would be something like all of your current states, multiplied by every possible position of every other token on the board.

So to answer this question:

in general how many iterations would I have to train my agent?

With a state space that accurately represents all arrangements of the board, too many iterations to be feasible. You will need to look into defining features of states to learn on. These features will highlight important differences between states and discard others, so you can think of this as compressing the state space that the agent is learning on. Then you may also consider using a function approximator instead of a Q-table to cope with what will likely still be a very large number of features. You can read more about this in Reinforcement Learning: An Introduction, particularly around 3.9.

AI Player is not performing well? why?

1 Answers1