1

The environment is a directed graph that consists of nodes which have their own "goodness"(marked green) and edges that have prices(marked red). In this environment exists Price(P) constraint. The goal is to accumulate the most "goodness" points from nodes, as possible while making a circle(for example 0-->6-->5-->0) and not exceeding price constraint..

I managed to implement Q-Learning algorithm when there are no constraints, but I don't fully understand how to add Hard Constrains, while approximating Q-Function.

For instance, starting point is 0. Price limit is 13. Taking path 0-->1-->2-->3-->4-->5-->0 wouldn't be a valid choice for Agent, because at node 5 price(13) limit was reached, therefore, Agent should be punished for violating constraints. However, taking path 0-->6-->5-->0 would be a correct choice for the Agent and therefore, he should be rewarded. What I do not understand, how to tell Agent that sometimes going from 5 to 0 is perfect choice and sometimes it is not applicable, because some constraints where violated. I tried to give huge penalty if price constraint was violated and ended episode immediately, but that didn't seem to workout.

My question(s):

  • How to add hard constraints to RL algorithms like Q-Learning?

  • Is Q-Learning a valid choice for that kind of problem?

  • Should other algorithms be chosen like Monte Carlo Tree Search instead of Q-Learning.

I assume it is a very common problem in real world scenarios, but I couldn't find any examples about that.

enter image description here

Benas.M
  • 340
  • 5
  • 14
  • Is this an episodic game as in the agent fails if it overspends and then resets at the beginning. And also the agent resets position when it does the correct 0-6-5-0 trajectory or does it just keep going? What does your current observation space and reward function look like for the working game? – MarcusRenshaw May 30 '20 at 18:57

0 Answers0