Which policy to use after training RL agent

Question

When running the Tensorflow agents notebook for the Soft Actor-Critic Minitaur, https://www.tensorflow.org/agents/tutorials/7_SAC_minitaur_tutorial, the following directories are created under /tmp:

+tmp
  -eval
  -train
  +policies
    -checkpoints
    -collect_policy
    -greedy_policy
    -policy

I initially assumed that 'collect_policy' is the policy from which the agent learns (since SAC is off-policy), and 'greedy_policy' is the optimal policy, which is continually updated as training progresses, and 'checkpoints' are stored if you want to resume training at a later stage. What 'policy' is, I don't know.

However, I see that 'collect_policy', 'greedy_policy' and 'policy' sometimes only modified when training starts, specifically when the checkpointing triggers are created:

# Triggers to save the agent's policy checkpoints.
learning_triggers = [
    triggers.PolicySavedModelTrigger(
        saved_model_dir,
        tf_agent,
        train_step,
        interval=policy_save_interval),
    triggers.StepPerSecondLogTrigger(train_step, interval=1000),
]

And other times they are updated continuously. Checkpoints are always updated continuously. I am therefore unsure which policy should be used after training (for inference, so to say), since the checkpoints only store model variables, which need to be loaded in conjunction with a policy as far as I'm concerned.

To summarize: after training, which policy (or policy + checkpoint) do I use to get the best results, and how do I load it?

Thanks!

Which policy to use after training RL agent

0 Answers0