I have a simulation that ticks the time every 5 seconds. I want to use OpenAI and its baselines algorithms to perform learning in this environment. For that I'd like to adapt the simulation by writing some adapter code that corresponds to the OpenAI Env API. But there is a problem: The flow of control is defined by the Agent in the OpenAI setting. But in my world, the environment steps, independent of the agent. If the agent doesn't decide or is not fast enough, the world just keeps going without him. How would one achieve this reversal of triggering the next step?
In short: OpenAI Env gets stepped by the agent. My environment gives my agent about 2-3 seconds to decide and then just tells it what's new, again offering to make choice to act or not.
As an example: My environment is rather similar to a real world stock trading market. The agent gets 24 chances to buy / sell products for a certain limit price to accumulate a certain volume for that target time and at time step 24, the reward is given to the agent and the slot is completed. The reward is based on the average price paid per item in comparison to the average price by all market participants.
At any given moment, 24 slots are traded in parallel (a 24x parallel trading of futures). I believe for this I need to create 24 environments which leads me to believe A3C would be a good choice.