How can we design rewards for an RL algorithm to incentivize a group metric?

Question

I am using designing a reinforcement learning agent to guide individual cars within a bounded area of roads. The policy determines which route the car should take.

Each car can see the cars within 10 miles of it, their velocities, and the road graph of the whole bounded area. The policy of the RL-based agent must determine the actions of the cars in order to maximize the flow of traffic, lets say defined by reduced congestion.

How can we design rewards to incentivize each car to not act greedily and maximize just its own speed, but rather minimize congestion within the bounded area overall?

I tried writing a Q-learning based method for routing each vehicle, but this ended up compelling every car to greedily take the shortest route, producing a lot of congestion by crowding the cars together.

For example, how do we even design a reward function that tells each agent to optimize for the collective good, not be selfish? — dangerChihuahua007, Oct 31 '22 at 09:04

score 2 · Answer 1 · answered Nov 02 '22 at 13:34

It's good to see more people working on cooperative MARL. Shameless plug for my research effort, feel free to reach out to discuss.

I think you need to take a step back for your question. You ask how to design the rewards so the agents will benefit the environment rather than themselves. Now, if you wanted, you could have just given each agent a reward based on the total welfare of the population. This will probably work, and you probably won't want that because it defeats the purpose of a multi-agent environment, right?

If you want the agents to be selfish but somehow converge to a cooperative solution, this is a very difficult problem (which is what I'm working on.)

If you're okay with a compromise, you could use intrinsic motivation, like in these papers:

What all of these papers have in common is that they add another component to the reward of each agent. That component is pro-social, like incentivizing the agent to increase its influence over the actions of other agents. Still it's a less extreme solution than just making the reward be social welfare directly.

How can we design rewards for an RL algorithm to incentivize a group metric?

1 Answers1