Taking a look at the saving_experiences
file you shared, it looks like when you load these offline experiences into rllib
, it creates SampleBatch
objects, which is how data is presented to agents (at least for off-policy agents) when they are trained with gradient methods.
The "weights" refer to the priority weights of the samples, and would be used for Importance Sampling-weighted Prioritized Experience Replay (check out the PrioritzedReplayBuffer
class here if interested). You should be able to just set these all to 1.0
if you don't care too much about weighting them.
You should be able to use offline experiences with just (s, a, r, s')
with SAC, however, you may need to format your data into an appropriate SampleBatch
, e.g. with the code fragment below:
from ray.rllib.policy.sample_batch import SampleBatch, MultiAgentBatch
# Initialize SampleBatch
rllib_batch_dict = {"obs": s, "actions": a, "rewards": r, "new_obs": s`,
"dones": np.array([False for i in range(len(s))],
"weights": np.ones((len(s)), "eps_id": episode_ids,
"unroll_id": episode_steps, "agent_index": np.zeros(len(s))} # Where your data is stored in (s, a, r, s`)
# Wrap your dictionary in a SampleBatch wrapper
rllib_batch = SampleBatch(rllib_batch_dict)
# If you still get errors, try wrapping this in a MultiAgentSampleBatch
marl_batch_dict = {"0": rllib_batch}
marl_batch = MultiAgentSampleBatch(marl_batch_dict)
For "t", I think this is just the time step for the sample in the episode, e.g. 0 if it is the first step, 1 if it is the second, and so on. Perhaps you could just keep track of this as you create your (s, a, r, s
)` data?
Finally, not sure if you need this, but you can also try creating a SampleBatch
manually, and then wrapping this in a MultiAgentSampleBatch
if need be. To do that, you'll just need to follow the code above, and then add in the following key/value pairs before creating the SampleBatch
object (which is really just a dictionary wrapper):
'obs'
--> s
'actions'
--> a
rewards
--> r
'new_obs'
--> s'
'dones'
--> set these to whether the timestep marks the end of an episode
'agent_index'
--> 0 (if single-agent, else you'll need to index)
'eps_id'
--> This is a placeholder episode ID (I don't think it's ever used during training), could set to "0"
, "1"
etc.
'unroll_id'
--> This is the step number, e.g. 0
, 1
, etc.
'weights'
--> These are the weights for importance sampling. If you don't care about this you could just set them to 1.0
.
Hope this helps - good luck!