About testing Deep Q-Network after training.Training data and test data do not correspond

Question

I am using deep q network breakout to play Atari Breakout.

Some of the latest training results：

running reward: 10.19 at episode 19285, frame count 1900000
running reward: 9.95 at episode 19320, frame count 1910000
running reward: 9.12 at episode 19359, frame count 1920000
running reward: 8.89 at episode 19396, frame count 1930000
running reward: 8.26 at episode 19434, frame count 1940000
running reward: 8.71 at episode 19468, frame count 1950000
running reward: 8.04 at episode 19508, frame count 1960000
running reward: 8.17 at episode 19545, frame count 1970000
running reward: 8.10 at episode 19582, frame count 1980000
running reward: 8.66 at episode 19618, frame count 1990000
running reward: 8.42 at episode 19662, frame count 2000000
Solved at episode 19663!

In my testing, the rewards is that:

Returns:[0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]

Are there any problems?

test code:

from baselines.common.atari_wrappers import make_atari, wrap_deepmind
import numpy as np
import tensorflow as tf
from tensorflow import keras
import gym

seed = 42

model = keras.models.load_model('/content/drive/MyDrive/ai_games_assignment/model', compile=False)

env = make_atari("BreakoutNoFrameskip-v4")
env = wrap_deepmind(env, frame_stack=True, scale=True)
env.seed(seed)

env = gym.wrappers.Monitor(env, '/content/drive/MyDrive/ai_games_assignment/videosss', video_callable=lambda episode_id: True,force=True)

epsilon = 0
n_episodes = 10
returns = []

for _ in range(n_episodes):
  ret = 0

  state = np.array(env.reset())

  done = False
  while not done:
    if epsilon > np.random.rand(1)[0]:
      action = np.random.choice(num_actions)
    else:
      # Predict action Q-values
      # From environment state
      state_tensor = tf.convert_to_tensor(state)
      state_tensor = tf.expand_dims(state_tensor, 0)
      state_tensor = np.array(state_tensor)
      action_probs = model.predict(state_tensor)
      # Take best action
      action = tf.argmax(action_probs[0]).numpy()

    # Apply the sampled action in our environment
    state_next, reward, done, _ = env.step(action)
    state_next = np.array(state_next)

    ret += reward
  returns.append(ret)

env.close()

print('Returns:{}'.format(returns))

Welcome to SO. Please, take your time to properly format your question and carefully select the minimum amount of information to share to help others help you — Daemon Painter, Dec 02 '20 at 14:47

score 0 · Answer 1 · answered Dec 15 '20 at 13:43

0

You're not updating the state add this at the end of the while loop:

state = state_next

answered Dec 15 '20 at 13:43

SumGui NotNeeded

1

About testing Deep Q-Network after training.Training data and test data do not correspond

1 Answers1