After some amount of training on a custom Multi-agent environment using RLlib's (1.4.0) PPO network, I found that my continuous actions turn into nan (explodes?) which is probably caused by a bad gradient update which in turn depends on the loss/objective function.
As I understand it, PPO's loss function relies on three terms:
- The PPO Gradient objective [depends on outputs of old policy and new policy, the advantage, and the "clip" parameter=0.3, say]
- The Value Function Loss
- The Entropy Loss [mainly there to encourage exploration]
Total Loss = PPO Gradient objective (clipped) - vf_loss_coeff * VF Loss + entropy_coeff * entropy.
I have set entropy coeff to 0. So I am focusing on the other two functions contributing to the total loss. As seen below in the progress table, the relevant portion where the total loss becomes inf is the problem area. The only change I found is that the policy loss was all negative until row #445.
So my question is: Can anyone explain what policy loss is supposed to look like and if this is normal? How do I resolve this issue with continuous actions becoming nan after a while? Is it just a question of lowering the learning rate?
EDIT
Here's a link to the related question (if you need more context)
END OF EDIT
I would really appreciate any tips! Thank you!
Total loss | policy loss | VF loss | |
---|---|---|---|
430 | 6.068537 | -0.053691725999999995 | 6.102932 |
431 | 5.9919114 | -0.046943977000000005 | 6.0161843 |
432 | 8.134636 | -0.05247503 | 8.164852 |
433 | 4.222730599999999 | -0.048518334 | 4.2523246 |
434 | 6.563492 | -0.05237444 | 6.594456 |
435 | 8.171028999999999 | -0.048245672 | 8.198222999999999 |
436 | 8.948264 | -0.048484523 | 8.976327000000001 |
437 | 7.556602000000001 | -0.054372005 | 7.5880575 |
438 | 6.124418 | -0.05249534 | 6.155608999999999 |
439 | 4.267647 | -0.052565258 | 4.2978816 |
440 | 4.912957700000001 | -0.054498855 | 4.9448576 |
441 | 16.630292999999998 | -0.043477765999999994 | 16.656229 |
442 | 6.3149705 | -0.057527818 | 6.349851999999999 |
443 | 4.2269225 | -0.05446908599999999 | 4.260793700000001 |
444 | 9.503102 | -0.052135203 | 9.53277 |
445 | inf | 0.2436709 | 4.410831 |
446 | nan | -0.00029848056 | 22.596403 |
447 | nan | 0.00013323531 | 0.00043436907999999994 |
448 | nan | 1.5656527000000002e-05 | 0.0002645221 |
449 | nan | 1.3344318000000001e-05 | 0.0003139485 |
450 | nan | 6.941916999999999e-05 | 0.00025863337 |
451 | nan | 0.00015686743 | 0.00013607396 |
452 | nan | -5.0206604e-06 | 0.00027541115000000003 |
453 | nan | -4.5543664e-05 | 0.0004247162 |
454 | nan | 8.841756999999999e-05 | 0.00020278389999999998 |
455 | nan | -8.465959e-05 | 9.261127e-05 |
456 | nan | 3.8680790000000003e-05 | 0.00032097592999999995 |
457 | nan | 2.7373152999999996e-06 | 0.0005146417 |
458 | nan | -6.271608e-06 | 0.0013273798000000001 |
459 | nan | -0.00013192794 | 0.00030621013 |
460 | nan | 0.00038987884 | 0.00038019830000000004 |
461 | nan | -3.2747877999999998e-06 | 0.00031471922 |
462 | nan | -6.9349815e-05 | 0.00038836736000000006 |
463 | nan | -4.666238e-05 | 0.0002851575 |
464 | nan | -3.7067155e-05 | 0.00020161088 |
465 | nan | 3.0623291e-06 | 0.00019258813999999998 |
466 | nan | -8.599938e-06 | 0.00036465342000000005 |
467 | nan | -1.1529375e-05 | 0.00016500981 |
468 | nan | -3.0851965e-07 | 0.00022042097 |
469 | nan | -0.0001133984 | 0.00030230957999999997 |
470 | nan | -1.0735256e-05 | 0.00034000343000000003 |