I have been trying to understand a blog on soft actor critic where we have a neural network representing a policy that outputs mean and std of gaussian distribution of action for a given state. Since direct back-propagation through stochastic node is not possible , reparamterization trick is applied as follows:
`normal = Normal(0, 1)
z = normal.sample()
action = torch.tanh(mean+ std*z.to(device))
log_prob = Normal(mean, std).log_prob(mean+ std*z.to(device)) - torch.log(1 - action.pow(2) + epsilon)
return action, log_prob, z, mean, log_std`
I want to know how the log_prob term was derived. Any help would be highly appreciated.