I am using A2C (Advantage Actor Critic) framework from stable-baselines3 (package link here) package for solving a reinforcement problem where reward is +1 or 0. I have an automatic mechanism to allocate reward to a choice in a given state. However, that automatic mechanism is not that good enough to reward my choices. I have evaluated that human judgement (if a human sits and rewards the choices) is better.
Now, I want to incorporate this human judgement into the A2C framework in training.
This is my understanding of how A2C works:
Let's say there are N timesteps in 1 episode. The trajectory is stored in an experience replay buffer: [(S1, A1, R1), (S2, A2, R2) ...] which is used to train the actor and critic neural networks at the end of the episode.
Can I access this buffer that is sent to neural networks for training? Or is there any alternative to introduce human in the loop in A2C framework?