Simulation parameters and reward calculation in benchmark scenario "Merge"

Question

I am currently trying to reproduce some results on my installation of flow from your previous papers. I ran over the following questions, where I am not clear about the exact parameters used in the experiments, and the results given in the papers.

For [1], I expected to be able to reproduce the results by running stabilizing_highway.py from your repo. (with commit "bc44b21", although I tried to run the current version, but could not find differences related to my questions). I expected the merge scenario used being the same in [2].

Where I already found differences in the papers/code was:

1) the reward function in [2] (2) is different than in [1] (6): the first uses a max and normalizing in the first part of the sum. Why this difference? Looking at the code, I interpret it the following: Depending on the evaluate flag, you either compute (a) the reward as average speed over all vehicles in the simulation or (b) as the function given in [2] (without the normalizing term on the speed), but with a value of alpha (eta2 in the code) = 0.1 (see merge.py, line 167, compute_reward). I could not find the alpha parameter given in the papers, so I assume the code version was used?

2) I further read the code as if you were calculating it by iterating over ALL vehicles in the simulation, not just the observed ones? This seems counterintuitive to me, using a reward function in a partially observed environment to train the agent by using information from the fully observed state information...!?

3) This leads to the next question: you eventually want to evaluate the reward as given when the evaluate flag is set, namely the average speed of all vehicles in the simulation, as given in Table 1 of [1]. Are these values calculated by averaging over the "speed" column in the emissions.csv file you can produce running the visualizer tool?

4) The next question is regarding the cumulative return in the Figures of [1] and [2]. In [1], FIgure 3, in the merge scenarios, the cum. returns are max of around 500, while the max. values of [2], Figure 5 are around 200000. Why this difference? The different reward functions used? Please, could you provide the alpha values for both and verify which version is correct (paper or code)?

5) What I also observe looking at [1] Table 1, Merge1&2: ES has clearly the highest values of average speed, but TRPO and PPO have a better cumulative return. Does this suggest that the 40 rollouts for evaluation where not enough to get a representative mean value? Or that maximizing the training reward function does not necessarily give good evaluation results?

6) Some other parameters are unclear to me: In [1] Fig3, 50 rollouts are mentioned, while N_ROLLOUTS=20. What do you recommend using? In [1] A.2 Merge, T=400, while HORIZON=600, and [2] C. Simulations talks about 3600s. Looking at a replay in Sumo produced when running visualizer_rllib.py, Simulation terminates at time 120.40, which would match the HORIZON of 600 with time steps of 0.2s (this information is given in [2].) So I assume, that for this scenario, the horizon should be set much higher than both in 1 and the code, and rather set to 18.000?

Thanks for any hints! KR M

[1] Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, F., ... & Bayen, A. M. (2018, October). Benchmarks for reinforcement learning in mixed-autonomy traffic. In Conference on Robot Learning (pp. 399-409)

[2] Kreidieh, Abdul Rahman, Cathy Wu, and Alexandre M. Bayen. "Dissipating stop-and-go waves in closed and open networks via deep reinforcement learning." In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 1475-1480. IEEE, 2018.

And I will tell a fairy tale .... – Mikhail Zhuikov Jun 18 '19 at 13:42 — Mikhail Zhuikov, Jun 18 '19 at 13:42

score 1 · Accepted Answer · answered Jul 02 '19 at 20:26

Apologies for the delay in the answer.

The version described in the code was the one that is used. Paper [1] was written after paper [2] (despite one being published earlier) and we added a normalizing term to help standardize the learning rate across problems. The reward function is the one used in the codebase; the evaluate flag being true corresponds to actually computing the traffic statistic (i.e. speed) whereas it being false corresponds to the reward function we use at train time.
As you point out, using all of the vehicles in the reward function is a bad assumption, we obviously do not have access to all of that data (though you could imagine we are able to read it out through an induction loop). Future work will focus on removing this assumption.
You can do it this way. However, we just calculate it by running the experiment with the trained policy, storing all the vehicle speeds at each step, and then computing the result at the end of the experiment.
Unfortunately, both versions are "correct", as you point out, the difference has to do with the addition of the "eta" term in [2] and the normalization in [1].
It's as you say, the training reward function is not the same as the test reward function, so an algorithm that does well with the evaluate flag off may not do as well with the evaluate flag on. This is a choice we made, to have separate training and testing functions. You're welcome to use the testing function at train time!
Both should work; I suspect the N=20 in the codebase is something that crept in over time as people found that 50 was not necessary for that scenario. However, every RL algorithm does better with more rollouts so setting N=50 won't hurt. As for the horizon, as far as I can tell in the codebase the answer is that the sim_step is 0.5, the horizon is 750, so the experiment should run for 375 seconds.

If you have more questions, please email the corresponding author (me)! I'd love to help you work through this in more detail.

Simulation parameters and reward calculation in benchmark scenario "Merge"

1 Answers1