I am new to Machine Learning, and I am trying to solve MountainCar-v0 using Q-learning. I can solve the problem now, but I am still confused.
According to the MountainCar-v0's Wiki, the reward remains -1 for every step, even if the car has reached the destination. How does the invariant reward help the agent learn? If every step gives the same reward, how can the agent tell if it is a good move or a bad move?
Thanks in advance!