I tried to solve the aigym mountain-car problem with my own q-learning implementation.
After trying around different things it started to work really good, but after a while (20k Episodes * 1000 Samples per Episode) I noticed that my the values stored in my Q-table got to big and so it stored the value -inf.
During the simulation I used to following code:
for t in range(SAMPLE_PER_EPISODE):
observation, reward, done, info = env.step(action)
R[state, action] = reward
history.append((state,action,reward))
max_indexes = np.argwhere(Q[state,] == np.amax(Q[state,])).flatten()
action = np.random.choice(max_indexes)
For learning I used the following code after each episode:
#train
latest_best = 0
total_reward = 0
for entry in reversed(history):
Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma)
latest_best = np.max(Q[entry[0],:])
total_reward += entry[2]
I got really good results with that algorithm but the problem was - as explained above - that the Q-Values went really fast to -inf
I think I implemented the Q-Algorithm wrong, but after changing it to the following implementation, it doesn't work anymore (nearly as good as it did before):
#train
latest_best = 0
total_reward = 0
for entry in reversed(history):
# Here I changed the code
Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma - Q[entry[0],entry[1]])
latest_best = np.max(Q[entry[0],:])
total_reward += entry[2]
What am I doing wrong?