Evaluate_policy records much higher mean reward then stable baselines 3 logger

Question

As the title says, I am testen PPO with the Cartpole Environment using SB3, but if I look at the performance measured be the evaluate_policy function I reach a mean reward of 475 reliable at 20000 timesteps, but I need about 90000 timesteps if I look at console log to get comparable results during learning.

Why does my model perform so much better using the evaluation helper?

I used the same hyperparameters in both cases, and I used a new environment for the evaluation with the helper method.

score 0 · Answer 1 · answered Jan 21 '23 at 23:58

0

I think I have solved the "problem": evaluate_policy uses deterministic action in it's default settings, which leads to better results faster.

answered Jan 21 '23 at 23:58

Martin Brandys

1

Evaluate_policy records much higher mean reward then stable baselines 3 logger

1 Answers1