I want know how to add a constraint to Q-learning. I have an action resulting in two rewards every time (reward 1= delivery cost , reward 2= delivery time). I want to minimize the cost while ensuring max delivery time limit is not violated. Is there a standard/formalized way to do this?
1 Answers
The easiest solution would be to create a single reward function that takes both of those signals into account.
To minimize delivery costs, you'd want to start out defining your reward function like:
R(.) = -delivery_cost
The negation is there because Reinforcement Learning is typically about rewards which should be maximized, instead of costs which should be minimized.
A straightforward way to have the agent learn not to violate a delivery time limit would be to subtract a massive constant from the reward if the deliverty time limit is violated, and not add or subtract anything if it isn't violated. So, that would look something like:
R(.) =
-delivery_cost - M IF delivery_time > constraint,
-delivery_cost otherwise
The value of M would have to be something really big. How big is ''really big'' depends on how big you expect the delivery_costs can become, because it has to be bigger than that.
Of course it is also possible to create a smoother reward function than this, especially if you want to allow violating the delivery time constraint by a little bit if that means you get a significant reduction in costs.
If you want to look into significantly more complex solutions than what I proposed above, you'll want to look around for literature on Multi-Objective Reinforcement Learning.

- 8,090
- 2
- 32
- 55
-
Hi Dennis, that's good answer and very helpful. I will try your approach and see if the model can converge at some point of time through learning. – Jerry Nov 30 '17 at 15:16
-
Could you provide me with the relevant links as regards to the Multi-Objective Reinforcement Learning. I have a more complex version of model to extend. Basically I have several input parameters (i.e. number of trucks, the minimal loading volume requirement, cost per trip) and several output parameters (e.g. delivery cost, delivery time, truck utilisation rate, shipping frequency, with some KPIs the bigger the value the better the performance and some others in opposite). I would like to understand how to enable the model to learn the rewards that are different in nature. – Jerry Nov 30 '17 at 15:24
-
BTW, is there a standard and mature algorithm for Multi-Objective Reinforcement Learning. I saw there are great many in the literature and I dont know which one is credible. – Jerry Nov 30 '17 at 15:29
-
I've never personally done any real work in Multi-Objective RL, so I can't really give any solid personal recommendations other than my knowledge that that's basically the most commonly-used name for the problem you have. "A Survey of Multi-Objective Sequential Decision-Making" appears to be a good survey on the topic, but it's from 2013 so it won't include new stuff from the most recent years. Pdf for that one is available if you just google it – Dennis Soemers Nov 30 '17 at 15:43