I'm new to Reinforcement Learning. Recently, I've been trying to train a Deep Q Network to solve OpenAI gym's CartPole-v0 , where solving means achieving an average score of at least 195.0 over 100 consecutive episodes.
I am using a 2 layer neural network, experience replay with the memory containing 1 million experiences, epsilon greedy policy, RMSProp optimizer and Huber loss function.
With this setting, solving the task is taking several thousand episodes (> 30k). Learning is also quite unstable at times. So, is it normal for Deep Q Networks to oscillate and take this long for learning a task like this? What other alternatives (or improvements on my DQN) can give better results?