1

I have some offline experiences: (s, a, r, s') that were generated with a heuristic. And I want to use these when training SAC agents. Using the example saving_experiences to prepare my data gives me an error when using with SAC. Here is a colab where the issue is exposed for the pendulum-v0 environment. What I understand from the error message is that SAC is expecting some 'weights' (and some time 't'?!) beside the experiences that were generated as offline data. Can I use just the offline experiences (s, a, r, s') with SAC?

Thanks.

1 Answers1

0

Taking a look at the saving_experiences file you shared, it looks like when you load these offline experiences into rllib, it creates SampleBatch objects, which is how data is presented to agents (at least for off-policy agents) when they are trained with gradient methods.

The "weights" refer to the priority weights of the samples, and would be used for Importance Sampling-weighted Prioritized Experience Replay (check out the PrioritzedReplayBuffer class here if interested). You should be able to just set these all to 1.0 if you don't care too much about weighting them.

You should be able to use offline experiences with just (s, a, r, s') with SAC, however, you may need to format your data into an appropriate SampleBatch, e.g. with the code fragment below:

from ray.rllib.policy.sample_batch import SampleBatch, MultiAgentBatch

# Initialize SampleBatch
rllib_batch_dict = {"obs": s, "actions": a, "rewards": r, "new_obs": s`,
                    "dones": np.array([False for i in range(len(s))], 
                    "weights": np.ones((len(s)), "eps_id": episode_ids,
                    "unroll_id": episode_steps, "agent_index": np.zeros(len(s))}  # Where your data is stored in (s, a, r, s`)

# Wrap your dictionary in a SampleBatch wrapper
rllib_batch = SampleBatch(rllib_batch_dict)

# If you still get errors, try wrapping this in a MultiAgentSampleBatch
marl_batch_dict = {"0": rllib_batch}
marl_batch = MultiAgentSampleBatch(marl_batch_dict)

For "t", I think this is just the time step for the sample in the episode, e.g. 0 if it is the first step, 1 if it is the second, and so on. Perhaps you could just keep track of this as you create your (s, a, r, s )` data?

Finally, not sure if you need this, but you can also try creating a SampleBatch manually, and then wrapping this in a MultiAgentSampleBatch if need be. To do that, you'll just need to follow the code above, and then add in the following key/value pairs before creating the SampleBatch object (which is really just a dictionary wrapper):

  1. 'obs' --> s
  2. 'actions' --> a
  3. rewards --> r
  4. 'new_obs' --> s'
  5. 'dones' --> set these to whether the timestep marks the end of an episode
  6. 'agent_index' --> 0 (if single-agent, else you'll need to index)
  7. 'eps_id' --> This is a placeholder episode ID (I don't think it's ever used during training), could set to "0", "1" etc.
  8. 'unroll_id' --> This is the step number, e.g. 0, 1, etc.
  9. 'weights' --> These are the weights for importance sampling. If you don't care about this you could just set them to 1.0.

Hope this helps - good luck!

Ryan Sander
  • 419
  • 3
  • 6