RLlib PPO continuous actions seem to become nan after total_loss = inf?

Question

After some amount of training on a custom Multi-agent environment using RLlib's (1.4.0) PPO network, I found that my continuous actions turn into nan (explodes?) which is probably caused by a bad gradient update which in turn depends on the loss/objective function.

As I understand it, PPO's loss function relies on three terms:

The PPO Gradient objective [depends on outputs of old policy and new policy, the advantage, and the "clip" parameter=0.3, say]
The Value Function Loss
The Entropy Loss [mainly there to encourage exploration]

Total Loss = PPO Gradient objective (clipped) - vf_loss_coeff * VF Loss + entropy_coeff * entropy.

I have set entropy coeff to 0. So I am focusing on the other two functions contributing to the total loss. As seen below in the progress table, the relevant portion where the total loss becomes inf is the problem area. The only change I found is that the policy loss was all negative until row #445.

So my question is: Can anyone explain what policy loss is supposed to look like and if this is normal? How do I resolve this issue with continuous actions becoming nan after a while? Is it just a question of lowering the learning rate?

EDIT

Here's a link to the related question (if you need more context)

END OF EDIT

I would really appreciate any tips! Thank you!

	Total loss	policy loss	VF loss
430	6.068537	-0.053691725999999995	6.102932
431	5.9919114	-0.046943977000000005	6.0161843
432	8.134636	-0.05247503	8.164852
433	4.222730599999999	-0.048518334	4.2523246
434	6.563492	-0.05237444	6.594456
435	8.171028999999999	-0.048245672	8.198222999999999
436	8.948264	-0.048484523	8.976327000000001
437	7.556602000000001	-0.054372005	7.5880575
438	6.124418	-0.05249534	6.155608999999999
439	4.267647	-0.052565258	4.2978816
440	4.912957700000001	-0.054498855	4.9448576
441	16.630292999999998	-0.043477765999999994	16.656229
442	6.3149705	-0.057527818	6.349851999999999
443	4.2269225	-0.05446908599999999	4.260793700000001
444	9.503102	-0.052135203	9.53277
445	inf	0.2436709	4.410831
446	nan	-0.00029848056	22.596403
447	nan	0.00013323531	0.00043436907999999994
448	nan	1.5656527000000002e-05	0.0002645221
449	nan	1.3344318000000001e-05	0.0003139485
450	nan	6.941916999999999e-05	0.00025863337
451	nan	0.00015686743	0.00013607396
452	nan	-5.0206604e-06	0.00027541115000000003
453	nan	-4.5543664e-05	0.0004247162
454	nan	8.841756999999999e-05	0.00020278389999999998
455	nan	-8.465959e-05	9.261127e-05
456	nan	3.8680790000000003e-05	0.00032097592999999995
457	nan	2.7373152999999996e-06	0.0005146417
458	nan	-6.271608e-06	0.0013273798000000001
459	nan	-0.00013192794	0.00030621013
460	nan	0.00038987884	0.00038019830000000004
461	nan	-3.2747877999999998e-06	0.00031471922
462	nan	-6.9349815e-05	0.00038836736000000006
463	nan	-4.666238e-05	0.0002851575
464	nan	-3.7067155e-05	0.00020161088
465	nan	3.0623291e-06	0.00019258813999999998
466	nan	-8.599938e-06	0.00036465342000000005
467	nan	-1.1529375e-05	0.00016500981
468	nan	-3.0851965e-07	0.00022042097
469	nan	-0.0001133984	0.00030230957999999997
470	nan	-1.0735256e-05	0.00034000343000000003

score 1 · Answer 1 · edited Dec 26 '21 at 23:38

1

It appears that RLLIB's PPO configuration of grad_clip is way too big (grad_clip=40). I changed it to grad_clip=4 and it worked.

edited Dec 26 '21 at 23:38

answered Dec 14 '21 at 17:54

Pierrick Pochelu

11
1

changing the grad_clip did not help for me – Mario Jun 27 '22 at 11:40

score 1 · Answer 2 · answered Mar 01 '22 at 04:41

I met the same problem when running the rllib example. I also post my problem in this issue. I am also running PPO in a countious and bounded action space. The PPO output actions that are quite large and finally crash dued to Nan related error.

For me, it seems that when the log_std of the action normal distribution is too large, large actions(about 1e20) will appear. I copy the codes for calculate loss in RLlib(v1.10.0) ppo_torch_policy.py and paste them below.

logp_ratio = torch.exp(
    curr_action_dist.logp(train_batch[SampleBatch.ACTIONS]) -
    train_batch[SampleBatch.ACTION_LOGP])
action_kl = prev_action_dist.kl(curr_action_dist)
mean_kl_loss = reduce_mean_valid(action_kl)

curr_entropy = curr_action_dist.entropy()
mean_entropy = reduce_mean_valid(curr_entropy)

surrogate_loss = torch.min(
    train_batch[Postprocessing.ADVANTAGES] * logp_ratio,
    train_batch[Postprocessing.ADVANTAGES] * torch.clamp(
        logp_ratio, 1 - self.config["clip_param"],
                    1 + self.config["clip_param"]))

For that large actions, the logp curr_action_dist.logp(train_batch[SampleBatch.ACTIONS])computed by <class 'torch.distributions.normal.Normal'> will be -inf. And then curr_action_dist.logp(train_batch[SampleBatch.ACTIONS]) -train_batch[SampleBatch.ACTION_LOGP]) return Nan. torch.min and torch.clamp will still keep the Nan output(refer to the doc).

So in conclusion, I guess that the Nan is caused by the -inf value of the log probability of very large actions, and the torch failed to clip it according to the the "clip" parameter.

The difference is that I do not set entropy_coeff to zero. In my case, the std variance is encouraged to be as large as possible since the entropy is computed for the total normal distribution instead of the distribution restricted to the action space. I am not sure whether you get large σ as I do. In addition, I am using Pytorch, things may be different for Tf.

RLlib PPO continuous actions seem to become nan after total_loss = inf?

2 Answers2