I'm trying to build my own environement for study purposes in Q Learning and training it with a simple Neural Network and linear activation. The problem is, it doesn't seem to learn to play this simple game, which consists of the player reaching the goal without touching the enemy. The mean reward sum stays in the same area even after 2000 episodes. I would be very grateful, if someone identifys the problem in my code snippets.
# model
model = Sequential()
model.add(Dense(4,activation='relu',input_shape=(4,)))
model.add(Dense(4,activation='relu'))
model.add(Dense(4,activation='linear'))
model.compile(loss='mse', optimizer='adam')
###
# parameters
MOVE_REWARD = -1
ENEMY_REWARD = -100
GOAL_REWARD = 100
epsilon = 0.5
EPS_DECAY = 0.9999
DISCOUNT = 0.9
LEARNING = 0.8
###
# the learning loop in a loop with number of episodes:
# (after creating new positions for player, goal and enemy)
for i in range(200):
[some code]
# create delta coordinates of actual game state (obs)
# which consists of 1x4 vector with 4 +/- values
# player to goal in x,y and player to enemy in x,y
if np.random.random() > epsilon:
action = np.argmax(model.predict(obs))
else:
action = np.random.randint(0, 4)
player.move(action) # player moves in one direction
if player == enemy:
reward = ENEMY_REWARD
_end_ = True
elif player == goal:
reward = GOAL_REWARD
_end_ = True
else:
reward = MOVE_REWARD
[...]
# create delta coordinates of actual game state (new_obs)
# which consists of 1x4 vector with 4 +/- values
# player to goal in x,y and player to enemy in x,y
max_future_q = np.max(model.predict(new_obs))
current_q = model.predict(obs)[0][action]
if reward == GOAL_REWARD:
new_q = GOAL_REWARD
else:
new_q = (1 - LEARNING) * current_q + LEARNING * (reward + DISCOUNT * max_future_q)
target_vec = model.predict(obs)[0]
target_vec[action] = new_q
target_vec=target_vec.reshape(1,4)
model.fit(obs,target_vec,verbose=0,epochs=1)
if _end_ == True:
break
I tried to rewrite it with a Convolutional Network and I'm having the same issue. It would be great if someone finds an error in my algorithm!