1

After some amount of training on a custom Multi-agent environment using RLlib's (1.4.0) PPO network, I found that my continuous actions turn into nan (explodes?) which is probably caused by a bad gradient update which in turn depends on the loss/objective function.

As I understand it, PPO's loss function relies on three terms:

  1. The PPO Gradient objective [depends on outputs of old policy and new policy, the advantage, and the "clip" parameter=0.3, say]
  2. The Value Function Loss
  3. The Entropy Loss [mainly there to encourage exploration]

Total Loss = PPO Gradient objective (clipped) - vf_loss_coeff * VF Loss + entropy_coeff * entropy.

I have set entropy coeff to 0. So I am focusing on the other two functions contributing to the total loss. As seen below in the progress table, the relevant portion where the total loss becomes inf is the problem area. The only change I found is that the policy loss was all negative until row #445.

So my question is: Can anyone explain what policy loss is supposed to look like and if this is normal? How do I resolve this issue with continuous actions becoming nan after a while? Is it just a question of lowering the learning rate?

EDIT

Here's a link to the related question (if you need more context)

END OF EDIT

I would really appreciate any tips! Thank you!

Total loss policy loss VF loss
430 6.068537 -0.053691725999999995 6.102932
431 5.9919114 -0.046943977000000005 6.0161843
432 8.134636 -0.05247503 8.164852
433 4.222730599999999 -0.048518334 4.2523246
434 6.563492 -0.05237444 6.594456
435 8.171028999999999 -0.048245672 8.198222999999999
436 8.948264 -0.048484523 8.976327000000001
437 7.556602000000001 -0.054372005 7.5880575
438 6.124418 -0.05249534 6.155608999999999
439 4.267647 -0.052565258 4.2978816
440 4.912957700000001 -0.054498855 4.9448576
441 16.630292999999998 -0.043477765999999994 16.656229
442 6.3149705 -0.057527818 6.349851999999999
443 4.2269225 -0.05446908599999999 4.260793700000001
444 9.503102 -0.052135203 9.53277
445 inf 0.2436709 4.410831
446 nan -0.00029848056 22.596403
447 nan 0.00013323531 0.00043436907999999994
448 nan 1.5656527000000002e-05 0.0002645221
449 nan 1.3344318000000001e-05 0.0003139485
450 nan 6.941916999999999e-05 0.00025863337
451 nan 0.00015686743 0.00013607396
452 nan -5.0206604e-06 0.00027541115000000003
453 nan -4.5543664e-05 0.0004247162
454 nan 8.841756999999999e-05 0.00020278389999999998
455 nan -8.465959e-05 9.261127e-05
456 nan 3.8680790000000003e-05 0.00032097592999999995
457 nan 2.7373152999999996e-06 0.0005146417
458 nan -6.271608e-06 0.0013273798000000001
459 nan -0.00013192794 0.00030621013
460 nan 0.00038987884 0.00038019830000000004
461 nan -3.2747877999999998e-06 0.00031471922
462 nan -6.9349815e-05 0.00038836736000000006
463 nan -4.666238e-05 0.0002851575
464 nan -3.7067155e-05 0.00020161088
465 nan 3.0623291e-06 0.00019258813999999998
466 nan -8.599938e-06 0.00036465342000000005
467 nan -1.1529375e-05 0.00016500981
468 nan -3.0851965e-07 0.00022042097
469 nan -0.0001133984 0.00030230957999999997
470 nan -1.0735256e-05 0.00034000343000000003
hridayns
  • 697
  • 8
  • 16

2 Answers2

1

It appears that RLLIB's PPO configuration of grad_clip is way too big (grad_clip=40). I changed it to grad_clip=4 and it worked.

1

I met the same problem when running the rllib example. I also post my problem in this issue. I am also running PPO in a countious and bounded action space. The PPO output actions that are quite large and finally crash dued to Nan related error.

For me, it seems that when the log_std of the action normal distribution is too large, large actions(about 1e20) will appear. I copy the codes for calculate loss in RLlib(v1.10.0) ppo_torch_policy.py and paste them below.

logp_ratio = torch.exp(
    curr_action_dist.logp(train_batch[SampleBatch.ACTIONS]) -
    train_batch[SampleBatch.ACTION_LOGP])
action_kl = prev_action_dist.kl(curr_action_dist)
mean_kl_loss = reduce_mean_valid(action_kl)

curr_entropy = curr_action_dist.entropy()
mean_entropy = reduce_mean_valid(curr_entropy)

surrogate_loss = torch.min(
    train_batch[Postprocessing.ADVANTAGES] * logp_ratio,
    train_batch[Postprocessing.ADVANTAGES] * torch.clamp(
        logp_ratio, 1 - self.config["clip_param"],
                    1 + self.config["clip_param"]))

For that large actions, the logp curr_action_dist.logp(train_batch[SampleBatch.ACTIONS])computed by <class 'torch.distributions.normal.Normal'> will be -inf. And then curr_action_dist.logp(train_batch[SampleBatch.ACTIONS]) -train_batch[SampleBatch.ACTION_LOGP]) return Nan. torch.min and torch.clamp will still keep the Nan output(refer to the doc).

So in conclusion, I guess that the Nan is caused by the -inf value of the log probability of very large actions, and the torch failed to clip it according to the the "clip" parameter.

The difference is that I do not set entropy_coeff to zero. In my case, the std variance is encouraged to be as large as possible since the entropy is computed for the total normal distribution instead of the distribution restricted to the action space. I am not sure whether you get large σ as I do. In addition, I am using Pytorch, things may be different for Tf.

hellohawaii
  • 3,074
  • 6
  • 21