0

I'm trying to build my own environement for study purposes in Q Learning and training it with a simple Neural Network and linear activation. The problem is, it doesn't seem to learn to play this simple game, which consists of the player reaching the goal without touching the enemy. The mean reward sum stays in the same area even after 2000 episodes. I would be very grateful, if someone identifys the problem in my code snippets.

# model
model = Sequential()
model.add(Dense(4,activation='relu',input_shape=(4,)))
model.add(Dense(4,activation='relu'))
model.add(Dense(4,activation='linear'))
model.compile(loss='mse', optimizer='adam')
###
# parameters
MOVE_REWARD = -1
ENEMY_REWARD = -100
GOAL_REWARD = 100
epsilon = 0.5
EPS_DECAY = 0.9999
DISCOUNT = 0.9
LEARNING = 0.8
###
# the learning loop in a loop with number of episodes:
# (after creating new positions for player, goal and enemy)
for i in range(200):
    [some code]
    # create delta coordinates of actual game state (obs)
    # which consists of 1x4 vector with 4 +/- values
    # player to goal in x,y and player to enemy in x,y
    if np.random.random() > epsilon:
        action = np.argmax(model.predict(obs))
    else:
        action = np.random.randint(0, 4)
    player.move(action) # player moves in one direction
    if player == enemy:
        reward = ENEMY_REWARD
        _end_ = True
    elif player == goal:
        reward = GOAL_REWARD
        _end_ = True
    else:
        reward = MOVE_REWARD
    [...]
    # create delta coordinates of actual game state (new_obs)
    # which consists of 1x4 vector with 4 +/- values
    # player to goal in x,y and player to enemy in x,y
    max_future_q = np.max(model.predict(new_obs))
    current_q = model.predict(obs)[0][action]
    if reward == GOAL_REWARD:
        new_q = GOAL_REWARD
    else:
        new_q = (1 - LEARNING) * current_q + LEARNING * (reward + DISCOUNT * max_future_q)
    target_vec = model.predict(obs)[0]
    target_vec[action] = new_q
    target_vec=target_vec.reshape(1,4)
    model.fit(obs,target_vec,verbose=0,epochs=1)
    if _end_ == True:
        break

I tried to rewrite it with a Convolutional Network and I'm having the same issue. It would be great if someone finds an error in my algorithm!

ChrisP
  • 1
  • 2
  • Hi ChrisP, i am not expert in reinforcement learning but if the act of scoring a goal is harder than avoiding enemies(which inherently should be so; since avoiding enemies is a predecessor to score a goal), because you set the reward of scoring and the loss of avoiding enemies the same, the game might completely avoid trying to attempt a score and always try to not touch an enemy. – Ömer Faruk Kırlı Nov 15 '19 at 13:12
  • A possible solution in this sense could be to reduce the punishment of touching enemy, increase the goal value, or increase the punishment for moving, which would make just avoiding enemies without goaling an infeasible strategy versus moving enemy units. Try it and see how it goes. Best of luck :) 0m3rF – Ömer Faruk Kırlı Nov 15 '19 at 13:15

0 Answers0