TD(λ) in Delphi/Pascal (Temporal Difference Learning)

Question

I have an artificial neural network which plays Tic-Tac-Toe - but it is not complete yet.

What I have yet:

the reward array "R[t]" with integer values for every timestep or move "t" (1=player A wins, 0=draw, -1=player B wins)
The input values are correctly propagated through the network.
the formula for adjusting the weights:

enter image description here

What is missing:

the TD learning: I still need a procedure which "backpropagates" the network's errors using the TD(λ) algorithm.

But I don't really understand this algorithm.

My approach so far ...

The trace decay parameter λ should be "0.1" as distal states should not get that much of the reward.

The learning rate is "0.5" in both layers (input and hidden).

It's a case of delayed reward: The reward remains "0" until the game ends. Then the reward becomes "1" for the first player's win, "-1" for the second player's win or "0" in case of a draw.

My questions:

How and when do you calculate the net's error (TD error)?
How can you implement the "backpropagation" of the error?
How are the weights adjusted using TD(λ)?

Thank you so much in advance :)

score 2 · Accepted Answer · answered Jan 30 '11 at 22:56

If you're serious about making this work, then understanding TD-lambda would be very helpful. Sutton and Barto's book, "Reinforcement Learning" is available for free in HTML format and covers this algorithm in detail. Basically, what TD-lambda does is create a mapping between a game state and the expected reward at the game's end. As games are played, states that are more likely to lead to winning states tend to get higher expected reward values.

For a simple game like tic-tac-toe, you're better off starting with a tabular mapping (just track an expected reward value for every possible game state). Then once you've got that working, you can try using a NN for the mapping instead. But I would suggest trying a separate, simpler NN project first...

Yes, of course, understanding TD-lambda would be important. And I didn't really understand it. And of course I've read that book by Sutton. What the algorithm does roughly is clear to me. I've already tried that tabular mapping but now I would love to implement an ANN with TD-lambda working. — caw, Feb 01 '11 at 21:13

score 0 · Answer 2 · answered Jan 31 '11 at 01:24

I have been confused about this too, but I believe this is the way it works:

Starting from the end node, you check R, (output received) and E, (output expected). If E = R, it's fine, and you have no changes to make.

If E != R, you see how far off it was, based on thresholds and whatnot, and then shift the weights or threshold up or down a bit. Then, based on the new weights, you go back in, and guess whether or not it was too high, or too low, and repeat, with a weaker effect.

I've never really tried this algorithm, but that's basically the version of the idea as I understand it.

Yes, I think so too, this is probably the way it works. But the important thing is: HOW are the weights adjusted? ;) — caw, Feb 01 '11 at 21:09

score 0 · Answer 3 · answered Feb 08 '11 at 21:14

0

As far as I remember you do the training with a known result set - so you calculate the output for a known input and subtract your known output value from that - that is the error.

Then you use the error to correct the net - for a single layer NN adjusted with the delta rule I know that an epsilon of 0.5 is too high - something like 0.1 is better - slower but better. With backpropagation it is a bit more advanced - but as far as I remember the math equation description of a NN is complex and hard to understand - it is not that complicated.

take a look at http://www.codeproject.com/KB/recipes/BP.aspx

or google for "backpropagation c" - it is probably easier to understand in code.

answered Feb 08 '11 at 21:14

Rune Andersen

1,650
12
15

Thanks for your answer. Unfortunately this is not temporal difference learning - this is "backpropagation" with the delta rule as you described yet. – caw Feb 08 '11 at 22:47
Thanks for the negative vote - I see now that my contribution really does not help. So whoever cast it thanks! As for NN - so you are not looking at the generalized deltarule. I did find the PDP books that introduced it quite interesting. For tic-tac-toe I would think it more fun to do with a genetic algorithm. – Rune Andersen Apr 04 '11 at 21:41

TD(λ) in Delphi/Pascal (Temporal Difference Learning)

3 Answers3