I am experimenting with Q-learning using a super mario bros gym. I am trying to retrieve the best possible action using np.argmax, which should return something between 1-12. but it is returning values like 224440.. it's only returning this value sometimes and seems to be doing it more frequently as the program goes on..
I have tried logging the shape of the actions to see if I am making a mistake somewhere else, I have tried printing almost every value to see if something is being improperly set, but I can't seem to find anything.
Currently im catching these improper actions so they dont crash the program and randomizing their action, this obviously is not a solution but is for debugging purposes.
from nes_py.wrappers import JoypadSpace
import gym_super_mario_bros
from gym_super_mario_bros.actions import COMPLEX_MOVEMENT
from collections import defaultdict
#imports
import random
import numpy as np
env = gym_super_mario_bros.make('SuperMarioBros-v0')
env = JoypadSpace(env, COMPLEX_MOVEMENT)
Q = np.zeros((240 * 256 * 3, env.action_space.n)) # state size is based on 3 dimensional values of the screen
# hyper-parameters
epsilon = 0.1
alpha = 0.5 # Learning rate
gamma = 0.5 # Decay
# number of GAMES
episodes = 500000000000
for episode in range(1, episodes):
print("Starting episode: " + str(episode))
state = env.reset()
finished = False
# number of steps
while not finished:
if random.uniform(0, 1) < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(Q[state])
## FIX THIS!
if action > 12 or action < 0:
#print("Random: " + str(np.argmax(Q[state, :])))
print(action)
print(Q.shape)
action = env.action_space.sample()
new_state, reward, done, info = env.step(action)
Q[state, action] = Q[state, action] + alpha * (reward + gamma * np.max(Q[new_state, :]) - Q[state, action])
state = new_state
env.render()
if done:
finished = True
env.close()
It might very well be possible that I am misunderstanding some concepts here as I am still learning and experimenting with this. Any input or help would be greatly appreciated.