2

I tried to solve the aigym mountain-car problem with my own q-learning implementation.

After trying around different things it started to work really good, but after a while (20k Episodes * 1000 Samples per Episode) I noticed that my the values stored in my Q-table got to big and so it stored the value -inf.

During the simulation I used to following code:

for t in range(SAMPLE_PER_EPISODE):

    observation, reward, done, info = env.step(action)
    R[state, action] = reward

    history.append((state,action,reward))

    max_indexes = np.argwhere(Q[state,] == np.amax(Q[state,])).flatten()
    action = np.random.choice(max_indexes)

For learning I used the following code after each episode:

#train
latest_best = 0
total_reward = 0
for entry in reversed(history):
    Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma)

    latest_best = np.max(Q[entry[0],:])
    total_reward += entry[2]

I got really good results with that algorithm but the problem was - as explained above - that the Q-Values went really fast to -inf

I think I implemented the Q-Algorithm wrong, but after changing it to the following implementation, it doesn't work anymore (nearly as good as it did before):

#train
latest_best = 0
total_reward = 0
for entry in reversed(history):
    # Here I changed the code
    Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma - Q[entry[0],entry[1]])

    latest_best = np.max(Q[entry[0],:])
    total_reward += entry[2]

What am I doing wrong?

Sentry
  • 4,102
  • 2
  • 30
  • 38
greece57
  • 421
  • 1
  • 6
  • 16
  • What values did you choose for `lr` and `gamma`? I'm suspecting that they are too big and `Q[entry[0],entry[1]] =` will just keep growing. Also, you might want to refactor your code so that you have variable names like `action` and `state` instead of `entry[...]`, it's hard to understand what the code actually does. – Sentry Jan 19 '18 at 14:45
  • lr = 0.99, gamma = 0.8. Shouldn't it not grow as long as these values are < 1? I will refactor the code in the example tomorrow with my laptop :) – greece57 Jan 19 '18 at 17:43

1 Answers1

6

I think there are two problems with your code:

  1. Firstly, your learning rate is probably too high (lr = 0.99 from your comment) and your discounting factor (gamma = 0.8) might be, too.

The book Reinforcement Learning: An Introduction by Richard S. Sutton, one of the founding fathers of Reinforcement learning, is available online and I highly recommend you use it as a reference.

Q-Learning is a special case of Temporal Difference Learning and subchapter 6.2 mostly uses learning rates smaller than 0.15.

  1. Assuming that entry[0] is x_k, entry[1] is u_k and entry[2] is r_{k+1}, then this line

     Q[entry[0],entry[1]] = Q[entry[0],entry[1]] + lr * (entry[2] + latest_best * gamma - Q[entry[0],entry[1]])
    

is equivalent to

    Q[x_k, u_k] = Q[x_k, u_k] + lr * (r_{k+1} + latest_best * gamma - Q[x_k, u_k])

If this is supposed to represent the formula enter image description here there is a problem with your first version, because you basically keep summing up rewards that are only slightly discounted. The second version with the additional -Q[x_k, u_k] should be correct.

Other SO questions you might want to look at:

Sentry
  • 4,102
  • 2
  • 30
  • 38