I am currently trying to reproduce some results on my installation of flow from your previous papers. I ran over the following questions, where I am not clear about the exact parameters used in the experiments, and the results given in the papers.
For [1], I expected to be able to reproduce the results by running stabilizing_highway.py from your repo. (with commit "bc44b21", although I tried to run the current version, but could not find differences related to my questions). I expected the merge scenario used being the same in [2].
Where I already found differences in the papers/code was:
1) the reward function in [2] (2) is different than in [1] (6): the first uses a max and normalizing in the first part of the sum. Why this difference? Looking at the code, I interpret it the following: Depending on the evaluate flag, you either compute (a) the reward as average speed over all vehicles in the simulation or (b) as the function given in [2] (without the normalizing term on the speed), but with a value of alpha (eta2 in the code) = 0.1 (see merge.py, line 167, compute_reward). I could not find the alpha parameter given in the papers, so I assume the code version was used?
2) I further read the code as if you were calculating it by iterating over ALL vehicles in the simulation, not just the observed ones? This seems counterintuitive to me, using a reward function in a partially observed environment to train the agent by using information from the fully observed state information...!?
3) This leads to the next question: you eventually want to evaluate the reward as given when the evaluate flag is set, namely the average speed of all vehicles in the simulation, as given in Table 1 of [1]. Are these values calculated by averaging over the "speed" column in the emissions.csv file you can produce running the visualizer tool?
4) The next question is regarding the cumulative return in the Figures of [1] and [2]. In [1], FIgure 3, in the merge scenarios, the cum. returns are max of around 500, while the max. values of [2], Figure 5 are around 200000. Why this difference? The different reward functions used? Please, could you provide the alpha values for both and verify which version is correct (paper or code)?
5) What I also observe looking at [1] Table 1, Merge1&2: ES has clearly the highest values of average speed, but TRPO and PPO have a better cumulative return. Does this suggest that the 40 rollouts for evaluation where not enough to get a representative mean value? Or that maximizing the training reward function does not necessarily give good evaluation results?
6) Some other parameters are unclear to me: In [1] Fig3, 50 rollouts are mentioned, while N_ROLLOUTS=20. What do you recommend using? In [1] A.2 Merge, T=400, while HORIZON=600, and [2] C. Simulations talks about 3600s. Looking at a replay in Sumo produced when running visualizer_rllib.py, Simulation terminates at time 120.40, which would match the HORIZON of 600 with time steps of 0.2s (this information is given in [2].) So I assume, that for this scenario, the horizon should be set much higher than both in 1 and the code, and rather set to 18.000?
Thanks for any hints! KR M
[1] Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, F., ... & Bayen, A. M. (2018, October). Benchmarks for reinforcement learning in mixed-autonomy traffic. In Conference on Robot Learning (pp. 399-409)
[2] Kreidieh, Abdul Rahman, Cathy Wu, and Alexandre M. Bayen. "Dissipating stop-and-go waves in closed and open networks via deep reinforcement learning." In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 1475-1480. IEEE, 2018.