I am using deep q network breakout to play Atari Breakout.
Some of the latest training results:
running reward: 10.19 at episode 19285, frame count 1900000
running reward: 9.95 at episode 19320, frame count 1910000
running reward: 9.12 at episode 19359, frame count 1920000
running reward: 8.89 at episode 19396, frame count 1930000
running reward: 8.26 at episode 19434, frame count 1940000
running reward: 8.71 at episode 19468, frame count 1950000
running reward: 8.04 at episode 19508, frame count 1960000
running reward: 8.17 at episode 19545, frame count 1970000
running reward: 8.10 at episode 19582, frame count 1980000
running reward: 8.66 at episode 19618, frame count 1990000
running reward: 8.42 at episode 19662, frame count 2000000
Solved at episode 19663!
In my testing, the rewards is that:
Returns:[0.0, 0.0, 2.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Are there any problems?
test code:
from baselines.common.atari_wrappers import make_atari, wrap_deepmind
import numpy as np
import tensorflow as tf
from tensorflow import keras
import gym
seed = 42
model = keras.models.load_model('/content/drive/MyDrive/ai_games_assignment/model', compile=False)
env = make_atari("BreakoutNoFrameskip-v4")
env = wrap_deepmind(env, frame_stack=True, scale=True)
env.seed(seed)
env = gym.wrappers.Monitor(env, '/content/drive/MyDrive/ai_games_assignment/videosss', video_callable=lambda episode_id: True,force=True)
epsilon = 0
n_episodes = 10
returns = []
for _ in range(n_episodes):
ret = 0
state = np.array(env.reset())
done = False
while not done:
if epsilon > np.random.rand(1)[0]:
action = np.random.choice(num_actions)
else:
# Predict action Q-values
# From environment state
state_tensor = tf.convert_to_tensor(state)
state_tensor = tf.expand_dims(state_tensor, 0)
state_tensor = np.array(state_tensor)
action_probs = model.predict(state_tensor)
# Take best action
action = tf.argmax(action_probs[0]).numpy()
# Apply the sampled action in our environment
state_next, reward, done, _ = env.step(action)
state_next = np.array(state_next)
ret += reward
returns.append(ret)
env.close()
print('Returns:{}'.format(returns))