I am using designing a reinforcement learning agent to guide individual cars within a bounded area of roads. The policy determines which route the car should take.
Each car can see the cars within 10 miles of it, their velocities, and the road graph of the whole bounded area. The policy of the RL-based agent must determine the actions of the cars in order to maximize the flow of traffic, lets say defined by reduced congestion.
How can we design rewards to incentivize each car to not act greedily and maximize just its own speed, but rather minimize congestion within the bounded area overall?
I tried writing a Q-learning based method for routing each vehicle, but this ended up compelling every car to greedily take the shortest route, producing a lot of congestion by crowding the cars together.