2

I'm trying to implement a policy network agent for the game 2048 according to Karpathy's RL tutorial. I know the algorithm will need to play some batch of games, remember the inputs and actions taken, normalize and mean center the ending scores. However, I got stuck at the design of the loss function. How to correctly encourage actions that lead to the better final scores and discourage those that lead to worse scores?

When using softmax at the output layer, I devised something along this:

loss = sum((action - net_output) * reward)

where action is in one hot format. However, this loss doesn't seem to do much, the network doesn't learn. My full code (without the game environment) in PyTorch is here.

Gogis
  • 121
  • 2

1 Answers1

0

For the policy network in your code, I think you want something like this:

loss = -(log(action_probability) * reward)

Where action_probability is your network's output for the action performed in that timestep.

For example, if your network outputted a 10% chance of taking that action, but it provided a reward of 10, your loss would be: -(log(0.1) * 10) which is equal to 10.

But, if your network already thought that was a good move and outputted a 90% chance of taking that action you would have -log(0.9) * 10) which is roughly equal to 0.45, affecting the network less.

It's worth noting that PyTorch's log function isn't numerically stable and you might be better off using logsoftmax in the final layer of your network.

Omegastick
  • 1,773
  • 1
  • 20
  • 35