How can I import a ray rllib pytorch whole model into next round training and subsquent inference using torch save load method other than checkpoints

Question

In ray rllib, I usually apply ray.tune.run a ppo trainning like this:

ray.init(log_to_driver=False, num_cpus=3, 
    local_mode=args.local_mode, num_gpus=1)
env_config={"code":"codeA"}
config={
 env_config={
     "code":"codeA"},
 "parm":"paramA"}
stop = {
    "training_iteration": args.stop_iters,
    "timesteps_total": args.stop_timesteps,
    "episode_reward_mean": args.stop_reward,
}
results = tune.run(trainer, config=config1, verbose=0, 
  stop=stop1, checkpoint_at_end=True,                               
  metric='episode_reward_mean', mode="max", 
  checkpoint_freq=1
             )

  checkpoints = results.get_trial_checkpoints_paths(
    trial=results.get_best_trial(
    metric='episode_reward_mean', 
    mode="max"),metric='episode_reward_mean')
  checkpoint_path = checkpoints[0][0]
  metric = checkpoints[0][1]

At next round, I usually retrain the model using restore checkpoints method like this:

 results = tune.run('PPO', config=config1, verbose=0, 
      stop=stop, checkpoint_at_end=True,                                   
      metric='episode_reward_mean', mode="max", checkpoint_freq=1, restore=checkpoint_path)

In inference:

agent = ppo.PPOTrainer(config=config1, env=env)
agent.restore(checkpoint_path=checkpoint_path)

Those flow has worked. The questions are (1): If can I save the whole pytorch model at the end of ray.tune.run? (2) can I import the pytorch model at the next round ray.tune.run training other than checkpoints restoring? (3) at inferece stage, how can I import the trained whole pytorch model into the PPO agent? In the restore agent inference flow, I can not load more than 1o models into the computer memory at a time. The big loading shows an OOM problem. If I restore a model one by one, the checkpoint restoring process is too time-consuming and cannot meet the timeliness requirements. Can any one help me?

A_the_kunal · Answer 1 · 2022-01-06T06:41:50.033

0

You can look into keep_checkpoints_num and checkpoints_score_attr in tune.run() to customize how many checkpoints you want from here The default for keep_checkpoints_num is None so it will store all checkpoints but for storage constraints, you can keep only the top ones based on the checkpoints score attribute

edited Jan 06 '22 at 06:41

answered Jan 05 '22 at 19:42

A_the_kunal

59
2
8

2

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jan 06 '22 at 00:12

How can I import a ray rllib pytorch whole model into next round training and subsquent inference using torch save load method other than checkpoints

1 Answers1