2

This question relates to a previous question about imitation learning:

train stable baselines 3 with examples?

I want to create an very simple venv and/or policy to simulate the expert behavior. Specifically I want to create an expert directly corresponing from my observations and actions.

In the source code of generate_trajectories it says: "2) A Callable that takes an ndarray of observations and returns an ndarray of corresponding actions."

And the function additional needs an environment (venv).

So it seems, the training will sample actions and get obs. and rewards from the callable and/or the venv.

Now what I tried to achieve, is to train fixed tuples of obs. act. and rewards with fixed frequencies of the samples.

As an applicatino example of my settingsa, I would assume to train a robot car, which is driven by a human user. It would have three distance sensors (left, front, right) and three controls (left, forward, right). And I would try to record numpy-arrays of observations and actions produced from human-user-interactions with the robot.

For the training schedule I would try to achieve, that the training frequency of the observation-action-combinations 1:1 sould reflect the frequency of the real world data produced by the human driver.

I guess, that the training schedule from stable baselines will follow a random action and optimization plan and do its own optimization, which means, that this is driven by reward-optimization and probably will not correspond to the original data frequencies.

Maybe what I want to achieve is more a pretraining, but for evaluation purposes I want to train the net that way.

Would this also possible with imitation learning?

Mike75
  • 504
  • 3
  • 18
  • I dont think I understand the question about the training schedule. If you use imitation, the policy WILL be trained based on the expert data. You can read about how its done from the linked papers (https://github.com/HumanCompatibleAI/imitation#imitation-learning-baseline-implementations). The basic approach is to train the classifier to predict an expert's behavior. So yes it will start random but should converge to expert actions. – Bhupen Jul 20 '22 at 16:20
  • Thank you, this is all interestnig., it seems the paper on Maximum Causal Entropy Optimization is very helpful to understand the consequences of different schedules. "Many IOC approaches (Abbeel & Ng, 2004; Ziebart et al., 2008) consider cost functions linear in a set of features and attempt to find behaviors that induce the same feature counts as the policy to be mimicked (E[∑t fSt ] = ̃E[∑ t fSt ]); by linearity such behaviors achieve the same expected value." – Mike75 Jul 20 '22 at 21:16
  • Yeah, sounds interesting. I don't know if you want to share but what is your environment? – Bhupen Jul 21 '22 at 14:32
  • In fact it is a very simple experimental question: I use a robot (as the on e in the question): it can drive forward, left, right. And three distance sensors: half-left (45°), front and half-right (45°). A human driver controls the car for short sequences of actions. E.G. if it goes forward and reaches a wall, i turns left or right. Then I have 10-20 typical sequences and want a network to be trained with these. Afterwards the imitation net should drive and I want to investigate its overall behavior (also in novel situations not trained before). – Mike75 Jul 22 '22 at 05:50
  • I guess, there are two separate processes (as the code from the imitation tutorial shows): (1) an expert has to be trained (2) the TRAINED expert could be imitated. My main point in this question now is: how can I train an expert in step (1) just from a few obs-action-sequences? Do I have to create trajectories, or do I simply create an env which give back random observations, when an action is executed in the learning process conrtolled by stable baselines? – Mike75 Jul 22 '22 at 10:53
  • I'm even a step back, I want to generate a dataset using gym-retro of me playing, and then use those trajectories as human expert data, and at this moment I don't know how to do that. – aletelecomm Oct 26 '22 at 00:24

0 Answers0