0

I'm attempting to first train a PPOTrainer for 250 iterations on a simple environment, and then finish training it on a modified environment. (The only difference between the environments would be a change in one of the environment configuration parameters).

So far I have tried implementing the following:

ray.init()
config = ppo.DEFAULT_CONFIG.copy()
config["env_config"] = defaultconfig
trainer = ppo.PPOTrainer(config=config, env=qsd.QSDEnv)
trainer.config['env_config']['meas_quant']=1
for i in range(250):
    result = trainer.train()

#attempt to change the parameter 'meas_quant' from 1 to 2
trainer.config['env_config']['meas_quant'] = 2
trainer.workers.local_worker().env.meas_quant = 2

for i in range(250):
    result = trainer.train()

However, the second training still uses the initial environment configuration. Any help in figuring out how to fix this would be greatly appreciated!

sbrand
  • 11
  • 1

1 Answers1

2

I'd suggest one of two approaches

Create a new Trainer instance and restore from the first

ray.init()
env_config["meas_quant"] = 1    # Assuming env_config is set
config = {"env_config": env_config}  
trainer = ppo.PPOTrainer(config=config, env=qsd.QSDEnv)
for i in range(250):
    result = trainer.train()
checkpoint = trainer.save_to_object()

env_config['meas_quant'] = 2
config["env_config"] = env_config
trainer2 = ppo.PPOTrainer(config=config, env=qsd.QSDEnv)
trainer2.restore_from_object(checkpoint)
# Do whathever is needed ...

Alter the environment directly for each worker

May require modifying the environment to set the parameter you're looking to change.

# After the first training loop
trainer.workers.foreach_worker(
    lambda w: w.foreach_env(lambda e: e.meas_quant = 2)
)
# Do your stuff ...

As an aside, I would avoid using DEFAULT_CONFIG.copy since it only creates a shallow copy of the dictionary, so changes to nested configuration dicts could alter the original default configuration. Plus, RLlib's Trainer already deepmerges wathever config dict you pass to it with the default configuration.

aaglovatto
  • 53
  • 6