1

I have been trying to understand a blog on soft actor critic where we have a neural network representing a policy that outputs mean and std of gaussian distribution of action for a given state. Since direct back-propagation through stochastic node is not possible , reparamterization trick is applied as follows:

    `normal = Normal(0, 1)
    z      = normal.sample()
    action = torch.tanh(mean+ std*z.to(device))
    log_prob = Normal(mean, std).log_prob(mean+ std*z.to(device)) - torch.log(1 - action.pow(2) + epsilon)
    return action, log_prob, z, mean, log_std`

I want to know how the log_prob term was derived. Any help would be highly appreciated.

  • I think I might be too late, but you can read Appendix C of the original [Soft Actor-Critic paper](https://arxiv.org/abs/1801.01290). Basically, this `torch.log(1 - action.pow(2) + epsilon)` is from squashing the `action` between -1 and 1 using the `tanh` function. – esh3390 Jun 04 '23 at 10:21

0 Answers0