When running the Tensorflow agents notebook for the Soft Actor-Critic Minitaur, https://www.tensorflow.org/agents/tutorials/7_SAC_minitaur_tutorial, the following directories are created under /tmp:
+tmp
-eval
-train
+policies
-checkpoints
-collect_policy
-greedy_policy
-policy
I initially assumed that 'collect_policy' is the policy from which the agent learns (since SAC is off-policy), and 'greedy_policy' is the optimal policy, which is continually updated as training progresses, and 'checkpoints' are stored if you want to resume training at a later stage. What 'policy' is, I don't know.
However, I see that 'collect_policy', 'greedy_policy' and 'policy' sometimes only modified when training starts, specifically when the checkpointing triggers are created:
# Triggers to save the agent's policy checkpoints.
learning_triggers = [
triggers.PolicySavedModelTrigger(
saved_model_dir,
tf_agent,
train_step,
interval=policy_save_interval),
triggers.StepPerSecondLogTrigger(train_step, interval=1000),
]
And other times they are updated continuously. Checkpoints are always updated continuously. I am therefore unsure which policy should be used after training (for inference, so to say), since the checkpoints only store model variables, which need to be loaded in conjunction with a policy as far as I'm concerned.
To summarize: after training, which policy (or policy + checkpoint) do I use to get the best results, and how do I load it?
Thanks!