I have a ray actor which collects experiences (the buffer) and a ray actor that optimizes over them (the learner) + several actors that only collect experiences. This is similar to the Ape-X reinforcement learning algorithm.
My main problem is that using the learner to sample from the buffer takes a lot of time, because data can only be transferred from the buffer to the learner in cpu format (even if the learner and buffer are on the same machine). Consequently, in order to run an optimization pass on the learner, I will still need to push the data to the GPU, after every time I call ray.get(buffer.GetSamples.remote())
. This is very inefficient and takes a lot of time away from optimization calculations.
In an ideal world, the buffer would continuously push random samples to the GPU, and the learner could simply pick a chunk from those at each pass. How can I make this work?
Also, putting both the learner and the buffer into one ray actor doesn't work, because ray (and obv python) seems to have significant problems with multi-threading, and doing it in a serial fashion defeats the purpose (as it will be even slower).
Note that this is a follow-up to another question from me here.
EDIT: I should note that this is for PyTorch