I read about Tesauro's TD-Gammon program and would love to implement it for tic tac toe, but almost all of the information is inaccessible to me as a high school student because I don't know the terminology.
The first equation here, http://www.stanford.edu/group/pdplab/pdphandbook/handbookch10.html#x26-1310009.2
gives the "general supervised learning paradigm." It says that the w sub t on the left side of the equation is the parameter vector at time step t. What exactly does "time step" mean? Within the framework of a tic tac toe neural network designed to output the value of a board state, would the time step refer to the number of played pieces in a given game? For example the board represented by the string "xoxoxoxox" would be at time step 9 and the board "xoxoxoxo " would be at time step 8? Or would the time step refer to the amount of time elapsed since training has begun?
Since w sub t is the weight vector for a given time step, does this mean that every time step has its own evaluation function (neural network)? So to evaluate a board state with only one move, you would have to feed into a different NN than you would feed a board state with two moves? I think I am misinterpreting something here because as far as I know Tesauro used only one NN for evaluating all board states (though it is hard to find reliable information about TD-Gammon).
How come the gradient of the output is taken with respect to w and not w sub t?
Thanks in advance for clarifying these ideas. I would appreciate any advice regarding my project or suggestions for accessible reading material.