Implementing the TD-Gammon algorithm

Question

I am attempting to implement the algorithm from the TD-Gammon article by Gerald Tesauro. The core of the learning algorithm is described in the following paragraph:

I have decided to have a single hidden layer (if that was enough to play world-class backgammon in the early 1990's, then it's enough for me). I am pretty certain that everything except the train() function is correct (they are easier to test), but I have no idea whether I have implemented this final algorithm correctly.

import numpy as np

class TD_network:
    """
    Neural network with a single hidden layer and a Temporal Displacement training algorithm
    taken from G. Tesauro's 1995 TD-Gammon article.
    """
    def __init__(self, num_input, num_hidden, num_output, hnorm, dhnorm, onorm, donorm):
        self.w21 = 2*np.random.rand(num_hidden, num_input) - 1
        self.w32 = 2*np.random.rand(num_output, num_hidden) - 1
        self.b2 = 2*np.random.rand(num_hidden) - 1
        self.b3 = 2*np.random.rand(num_output) - 1
        self.hnorm = hnorm
        self.dhnorm = dhnorm
        self.onorm = onorm
        self.donorm = donorm

    def value(self, input):
        """Evaluates the NN output"""
        assert(input.shape == self.w21[1,:].shape)
        h = self.w21.dot(input) + self.b2
        hn = self.hnorm(h)
        o = self.w32.dot(hn) + self.b3
        return(self.onorm(o))

    def gradient(self, input):
        """
        Calculates the gradient of the NN at the given input. Outputs a list of dictionaries
        where each dict corresponds to the gradient of an output node, and each element in
        a given dict gives the gradient for a subset of the weights. 
        """ 
        assert(input.shape == self.w21[1,:].shape)
        J = []
        h = self.w21.dot(input) + self.b2
        hn = self.hnorm(h)
        o = self.w32.dot(hn) + self.b3

        for i in range(len(self.b3)):
            db3 = np.zeros(self.b3.shape)
            db3[i] = self.donorm(o[i])

            dw32 = np.zeros(self.w32.shape)
            dw32[i, :] = self.donorm(o[i])*hn

            db2 = np.multiply(self.dhnorm(h), self.w32[i,:])*self.donorm(o[i])
            dw21 = np.transpose(np.outer(input, db2))

            J.append(dict(db3 = db3, dw32 = dw32, db2 = db2, dw21 = dw21))
        return(J)

    def train(self, input_states, end_result, a = 0.1, l = 0.7):
        """
        Trains the network using a single series of input states representing a game from beginning
        to end, and a final (supervised / desired) output for the end state
        """
        outputs = [self(input_state) for input_state in input_states]
        outputs.append(end_result)
        for t in range(len(input_states)):
            delta = dict(
                db3 = np.zeros(self.b3.shape),
                dw32 = np.zeros(self.w32.shape),
                db2 = np.zeros(self.b2.shape),
                dw21 = np.zeros(self.w21.shape))
            grad = self.gradient(input_states[t])
            for i in range(len(self.b3)):
                for key in delta.keys():
                    td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])
                    delta[key] += a*(outputs[t + 1][i] - outputs[t][i])*td_sum
            self.w21 += delta["dw21"]
            self.w32 += delta["dw32"]
            self.b2 += delta["db2"]
            self.b3 += delta["db3"]

The way I use this is I play through a whole game (or rather, the neural net plays against itself), and then I send the states of that game, from start to finish, into train(), along with the final result. It then takes this game log, and applies the above formula to alter weights using the first game state, then the first and second game states, and so on until the final time, when it uses the entire list of game states. Then I repeat that many times and hope that the network learns.

To be clear, I am not after feedback on my code writing. This was never meant to be more than a quick and dirty implementation to see that I have all the nuts and bolts in the right spots.

However, I have no idea whether it is correct, as I have thus far been unable to make it capable of playing tic-tac-toe at any reasonable level. There could be many reasons for that. Maybe I'm not giving it enough hidden nodes (I have used 10 to 12). Maybe it needs more games to train (I have used 200 000). Maybe it would do better with different normalisation functions (I've tried sigmoid and ReLU, leaky and non-leaky, in different variations). Maybe the learning parameters are not tuned right. Maybe tic-tac-toe and its deterministic gameplay means it "locks in" on certain paths in the game tree. Or maybe the training implementation is just wrong. Which is why I'm here.

Have I misunderstood Tesauro's algorithm?

Hi Arthur, I don't have an answer, but I remembered a few words from Rich Sutton to put the difficulty of the problem into context: The primary reason for the failure is that backpropation is fairly tricky to use effectively, doubly so in an online application like reinforcement learning. It is true that Tesauro used this approach in his strikingly successful backgammon application, but note that at the time of his work with TDgammon, Tesauro was already an expert in applying backprop networks to backgammon. [...] — Pablo EM, Nov 26 '19 at 18:49
[...] He had already built the world's best computer player of backgammon using backprop networks. He had already learned all the tricks and tweaks and parameter settings to make backprop networks learn well. Unless you have a similarly extensive background of experience, you are likely to be very frustrated using a backprop network in reinforcement learning. http://www.incompleteideas.net/RL-FAQ.html#backpropagation . This comment is from several years ago, and I'm not sure how relevant is today. But I think people tend to underestimate the difficulty of combining RL + Backprop. — Pablo EM, Nov 26 '19 at 18:50
@PabloEM I'm beginning to believe you. Didn't think it was this difficult. I don't know what's simpler though. A genetic algorithm, perhaps? Oh well. — Arthur, Nov 27 '19 at 17:28

score 3 · Accepted Answer · answered Nov 27 '19 at 10:21

3

I can't say that I entirely understand your implementation, but this line jumps out to me:

                    td_sum = sum([l**(t-k)*grad[i][key] for k in range(t + 1)])

Comparing with the formula you reference:

I see at least two differences:

Your implementation sums over t+1 elements compared to t elements in the formula
The gradient should be indexed with the same k as used in l**(t-k), but in your implementation it is indexed with i and key, without any reference to k

Perhaps if you fix these discrepancies your solution will behave more as expected.

answered Nov 27 '19 at 10:21

Seb

4,422
14
23

1

I completely agree with you. I will have to move the computation of `grad` so that we get `self.gradient(input_states[k])` in there instead. And I think I put `t+1` there because the sum in the article runs to `k = t`. So now instead I have `td_sum = sum([l**(t-k - 1)*self.gradient(input_states[k])[i][key] for k in range(t)])`, although that doesn't really work either. But I still don't know whether it's the implementation that's wrong, or if I am just using the wrong "settings". – Arthur Nov 27 '19 at 17:29
I decided to give this the bounty because while my issue isn't entirely resolved, the answer was helpful. Also, the bounty wasn't formally posted to get a solution (although that was the ultimate goal), it was posted to get more attention. That attention brought this answer, so that's where the bounty will go. – Arthur Dec 02 '19 at 12:14
Many thanks, appreciate it. I wish I could have been of more help! Have you made any progress in the meantime? – Seb Dec 02 '19 at 13:01
No, I haven't. Mostly because I haven't had much time to work on it. But after reading the comments above by Pablo, I have more or less decided that I will give up on this approach, and rather go with something easier, like a genetic algorithm. Just toss a bunch of neural nets at the game and keep the ones that perform well. Seems like there is less fine-tuning that can go wrong. – Arthur Dec 02 '19 at 13:27

Implementing the TD-Gammon algorithm

1 Answers1