I have some problems with initializing the Policy Parameter Theta for the REINFORCE algorithm from the book: Reinforcement Learning: An Introduction, 2nd Edition, Chapter 12, Sutton & Barto
Here is the pseudo code for the algorithm:
They say you can initialize Thetas(Policy Parameter) with arbitrary values, e.g. 0. But then I get completely wrong results:
To fix this I had to initialize them in such a way so they correspond to an Epsilon-Greedy policy. This can be achieved by taking into account that using action preferences:
Probability(LEFT) = e^Theta[1] / (e^Theta[1] + e^Theta[2])
Probability(RIGHT) = e^Theta[2] / (e^Theta[1] + e^Theta[2])
Solving for example the second one(using p=Epsilon / 2) and the fact that Probability(RIGHT)=p and the two probabilities sum to 1:
Theta[1] = ln(p / (1 - p)) + Theta[2]
So giving Theta[2]
any value and using the offset above to calculate Theta[1]
produces the results from the book(sorry for reversing alphas on the first picture):
So I have no idea why the initial values should mimic an Epsilon Greedy policy and what's wrong when setting them to 0 as noted in the book.
The other problem is that the order for different alphas when comparing the REINFORCE algorithm are in exactly the opposite order as displayed in Figure 13.1 from the book. It was supposed Alpha=2^-14 to outperform all instead of being the worse.
Any help and clarification will be appreciated?
I have this implemented in Lua here: Lua code
Thank you