I'm trying to implement a policy network agent for the game 2048 according to Karpathy's RL tutorial. I know the algorithm will need to play some batch of games, remember the inputs and actions taken, normalize and mean center the ending scores. However, I got stuck at the design of the loss function. How to correctly encourage actions that lead to the better final scores and discourage those that lead to worse scores?
When using softmax at the output layer, I devised something along this:
loss = sum((action - net_output) * reward)
where action is in one hot format. However, this loss doesn't seem to do much, the network doesn't learn. My full code (without the game environment) in PyTorch is here.