I am working on a Reinforcement Learning problem in StableBaselines3.
I am trying to understand why this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(100000)
Does not give the exact same result as this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(50000)
model.learn(50000)
I say they don't give the same results because in both cases, I tested out the model on a test-set through a for-loop, and the performance was different. Given that I set deterministic=True in the for-loop and I didn't change the seed, the different performance must mean the networks are different, which means the training process was different.
I was under the impression that if I run model.learn() on an existing model, it would just pick up the training where it was previously left off, but I guess that's incorrect.
Can someone help me understand why those two situations deliver different results?