I am trying to implement an agent which uses Q-learning to play Ludo. I've trained it with an e-greedy action selector, with an epsilon of 0.1, and a learning rate of 0.6, and discount factor of 0.8.
I ran the game for around 50K steps, and haven't won a single game. This is puzzling as the Q table seem to be pretty accurate with what I want it to be. Why am I losing so much to random players? Shouldn't the system be able to win if the Q-table isn't changing that much, and in general how many iterations would I have to train my agent?
I am not sure how much information is needed, I will update the post with relevant information if needed.
Possible states, represented as rows in the Q-table:
- In home
- On globe
- On a star
- In goal
- On the Winner Road
- In safety with same colored player
- On free space
Possible actions, represented as columns for each state:
- Move out from home
- Get in goal
- Move to globe
- Move to star
- Move to goal via star
- Get into safety with an same colored token
- Get into the Winner Road
- Suicide if the opponent is on a globe
- Kill opponent
- Just move
- No move possible
I start by initializing my Q-table with random values, and end with a table that looks like this after 5000 iteration:
-21.9241 345.35 169.189 462.934 308.445 842.939 256.074 712.23 283.328 137.078 -32.8
398.895 968.8 574.977 488.216 468.481 948.541 904.77 159.578 237.928 29.7712 417.599
1314.25 756.426 333.321 589.25 616.682 583.632 481.84 457.585 683.22 329.132 227.329
1127.58 1457.92 1365.58 1429.26 1482.69 1574.66 1434.77 1195.64 1231.01 1232.07 1068
807.592 1070.17 544.13 1385.63 883.123 1662.97 524.08 966.205 1649.67 509.825 909.006
225.453 1141.34 536.544 242.647 1522.26 1484.47 297.704 993.186 589.984 689.73 1340.89
1295.03 310.461 361.776 399.866 663.152 334.657 497.956 229.94 294.462 311.505 1428.26
My immediate reward is based on how far each token is in the game multiplied with constant 10, after an action has been performed. Home position has position -1 and goal position has position 99. and all position in-between has positions between 0 - 55. If a Token is in goal, will a extra reward +100 be added to the immediate reward for each token in goal.
Usually, my player moves always one token to the goal... and thats it.