When I run DQN and check performance of the policy, it often shows a high fluctuation.
Also it is not difficult to find performance pictures like this online.
A graph image showing a lot of fluctuations in performance(1)
(2)
I am quite confused why such thing happens. Doesn't DQN move from one local optima to the next higher optima as exploration goes on? in that sense, why does it degrade to lower optima?