0

I am experimenting with Q-learning using a super mario bros gym. I am trying to retrieve the best possible action using np.argmax, which should return something between 1-12. but it is returning values like 224440.. it's only returning this value sometimes and seems to be doing it more frequently as the program goes on..

I have tried logging the shape of the actions to see if I am making a mistake somewhere else, I have tried printing almost every value to see if something is being improperly set, but I can't seem to find anything.

Currently im catching these improper actions so they dont crash the program and randomizing their action, this obviously is not a solution but is for debugging purposes.

from nes_py.wrappers import JoypadSpace
import gym_super_mario_bros
from gym_super_mario_bros.actions import COMPLEX_MOVEMENT
from collections import defaultdict 

#imports
import random
import numpy as np

env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = JoypadSpace(env, COMPLEX_MOVEMENT)

Q = np.zeros((240 * 256 * 3, env.action_space.n)) # state size is based on 3 dimensional values of the screen

# hyper-parameters
epsilon = 0.1
alpha = 0.5 # Learning rate
gamma = 0.5 # Decay

# number of GAMES
episodes = 500000000000

for episode in range(1, episodes):
    print("Starting episode: " + str(episode))
    state = env.reset()
    finished = False

    # number of steps
    while not finished:
        if random.uniform(0, 1) < epsilon:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q[state])
            ## FIX THIS!
            if action > 12 or action < 0:
                #print("Random: " + str(np.argmax(Q[state, :])))
                print(action)
                print(Q.shape)
                action = env.action_space.sample()

        new_state, reward, done, info = env.step(action)

        Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[new_state, :]) - Q[state, action])

        state = new_state
        env.render()

        if done:
            finished = True

env.close()

It might very well be possible that I am misunderstanding some concepts here as I am still learning and experimenting with this. Any input or help would be greatly appreciated.

  • What is the shape of `Q[state]`? When no `axis` value is given, [`np.argmax`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html) returns the index of the largest value in the flattened input array. – jdehesa Nov 11 '19 at 13:48
  • print(Q[state].shape) returns: (240, 256, 3, 12) Edit: This is on one of the failed cases (where the action provides, in this instance: 224437) – Thijmen Boot Nov 11 '19 at 13:52
  • Also yes, the index of the largest Q value / reward should be returned as this index is 0-12 which is the value for the action. This is working perfectly 70% of the time. but then it sometimes returnes random high values. – Thijmen Boot Nov 11 '19 at 14:00
  • So you will be getting an index between 0 and 240*256*3*12 = 2211840... `state`, I suppose, is an RGB 240x256 image. If each channel is one byte (256 possible values), the total number of possible states (all possible images of that size) is 256^(240*256*3), which is a number so large even Python needs a moment to write it out (more than four hundred thousand digits). You simply cannot use basic Q-learning for this kind of problem. – jdehesa Nov 11 '19 at 14:05
  • While I understand your point on there being a huge amount of states and Q learning not being the right way to do this. this should not be the reason why I am getting a value that is not between 0-12, as all states are creating with 12 values. – Thijmen Boot Nov 11 '19 at 14:11
  • The reason for that is the first part... You state has shape `(240, 240, 3)`, with numbers in [0, 255]. `Q[state]` has then shape `(240, 240, 3, 12)` (not just `(12,)`). When you pass this to [`np.argmax`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html), the array is flattened into shape `(2211840,)`, and the index of the largest value there is selected. Initially it is all zeros, so by convention the index `0` is returned, but as `Q` is updated with some values, eventually the largest value falls well beyond the twelve first positions of the flattened array. – jdehesa Nov 11 '19 at 14:17
  • That's very interesting, thank you for the insightful comment. What do you recommend to fix this? – Thijmen Boot Nov 11 '19 at 14:55
  • Well, if you want to use Q-learning like this, you should either change the problem or, at least, the state space. Even if you take single pixel, you would have over 16 million possible states (2^24). As you are doing things, `state` should become a single integer value (not an array of values), so `Q[state]` would be a 12-element vector (shape `(12,)`) and `np.argmax` would give the expected result. – jdehesa Nov 11 '19 at 15:02
  • In practice, Q-learning can only really be applied to limited discrete problems. You can try it with some of the [toy text environments](https://gym.openai.com/envs/#toy_text), like [frozen lake](https://gym.openai.com/envs/FrozenLake-v0/), where you just have a bunch of possible states (the tile where the agent stands). These are way less exciting than SMB, but if you want to go into that kind of problems you will have to look more advanced models, at least deep Q-learning (or more like A3C, etc). [This](https://link.medium.com/nK4POt0Xm0) is a good series on RL (that you may know already). – jdehesa Nov 11 '19 at 15:11
  • Thank you very much I will look into this! – Thijmen Boot Nov 11 '19 at 15:34

0 Answers0