0

I am using A2C (Advantage Actor Critic) framework from stable-baselines3 (package link here) package for solving a reinforcement problem where reward is +1 or 0. I have an automatic mechanism to allocate reward to a choice in a given state. However, that automatic mechanism is not that good enough to reward my choices. I have evaluated that human judgement (if a human sits and rewards the choices) is better.

Now, I want to incorporate this human judgement into the A2C framework in training.

This is my understanding of how A2C works:

Let's say there are N timesteps in 1 episode. The trajectory is stored in an experience replay buffer: [(S1, A1, R1), (S2, A2, R2) ...] which is used to train the actor and critic neural networks at the end of the episode.

Can I access this buffer that is sent to neural networks for training? Or is there any alternative to introduce human in the loop in A2C framework?

Prasanjit Rath
  • 166
  • 2
  • 13

1 Answers1

0

Of course! The environment is a simple python script in which, somewhere at the end of env.step, the reward is calculated and returned, to be then added along with the state and the action to the replay buffer.

You could then manually insert the reward value each time an action is taken, using simple I/O commands.

However, Deep Reinforcement Learning usually requires hundreds of thousands of iterations (experience) before learning something useful (unless the environment is simple enough).