So, I created a custom environment based on gymnasium and I want to train it with PPO from stable_baselines3
. I'm using version 2.0.0a5 of the latter, in order to use gymnasium. I have the following code:
env = MyEnv()
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=1, progress_bar=True)
This code does not stop, the progress bar goes over the total number of time steps and just goes on... I may be doing something wrong with the environment but I am not sure what and why it would mean that the learning process makes more iterations than the total_timesteps
fixed by the user.
So, what could go wrong with the environment? What should I check that could make the learning process infinite?
Edit: the plot thickens. I tried the same thing with an SAC agent and it does not go into an infinite loop during learning. But it does one during evaluation!