This question relates to a previous question about imitation learning:
train stable baselines 3 with examples?
I want to create an very simple venv and/or policy to simulate the expert behavior. Specifically I want to create an expert directly corresponing from my observations and actions.
In the source code of generate_trajectories it says: "2) A Callable that takes an ndarray of observations and returns an ndarray of corresponding actions."
And the function additional needs an environment (venv).
So it seems, the training will sample actions and get obs. and rewards from the callable and/or the venv.
Now what I tried to achieve, is to train fixed tuples of obs. act. and rewards with fixed frequencies of the samples.
As an applicatino example of my settingsa, I would assume to train a robot car, which is driven by a human user. It would have three distance sensors (left, front, right) and three controls (left, forward, right). And I would try to record numpy-arrays of observations and actions produced from human-user-interactions with the robot.
For the training schedule I would try to achieve, that the training frequency of the observation-action-combinations 1:1 sould reflect the frequency of the real world data produced by the human driver.
I guess, that the training schedule from stable baselines will follow a random action and optimization plan and do its own optimization, which means, that this is driven by reward-optimization and probably will not correspond to the original data frequencies.
Maybe what I want to achieve is more a pretraining, but for evaluation purposes I want to train the net that way.
Would this also possible with imitation learning?