0

I implemented the Q-learning algorithm and used it on FrozenLake-v0 on OpenAI gym. I am getting 185 total rewards during training and 7333 total rewards during testing in 10000 episodes. Is this good ?

Also I tried the Dyna-Q algorithm. But it is giving worse performance than Q-learning. Approx. 200 total rewards during training and 700-900 total rewards during testing in 10000 episodes with 50 planning steps.

Why is this happening ?

Below is the code. Is something wrong with the code ?

# Setup
env = gym.make('FrozenLake-v0')

epsilon = 0.9
lr_rate = 0.1
gamma = 0.99
planning_steps = 0

total_episodes = 10000
max_steps = 100

Training and testing():

while t < max_steps:
    action = agent.choose_action(state)  
    state2, reward, done, info = agent.env.step(action)  
    # Removed in testing
    agent.learn(state, state2, reward, action)
    agent.model.add(state, action, state2, reward)
    agent.planning(planning_steps)
    # Till here
    state = state2
def add(self, state, action, state2, reward):
        self.transitions[state, action] = state2
        self.rewards[state, action] = reward

def sample(self, env):
    state, action = 0, 0
    # Random visited state
    if all(np.sum(self.transitions, axis=1)) <= 0:
        state = np.random.randint(env.observation_space.n)
    else:
        state = np.random.choice(np.where(np.sum(self.transitions, axis=1) > 0)[0])

    # Random action in that state
    if all(self.transitions[state]) <= 0:
        action = np.random.randint(env.action_space.n)
    else:    
        action = np.random.choice(np.where(self.transitions[state] > 0)[0])
    return state, action

def step(self, state, action):
    state2 = self.transitions[state, action]
    reward = self.rewards[state, action]
    return state2, reward

def choose_action(self, state):
    if np.random.uniform(0, 1) < epsilon:
        return self.env.action_space.sample()
    else:
        return np.argmax(self.Q[state, :])

def learn(self, state, state2, reward, action):
    # predict = Q[state, action]
    # Q[state, action] = Q[state, action] + lr_rate * (target - predict)
    target = reward + gamma * np.max(self.Q[state2, :])
    self.Q[state, action] = (1 - lr_rate) * self.Q[state, action] + lr_rate * target

def planning(self, n_steps):
    # if len(self.transitions)>planning_steps:
    for i in range(n_steps):
        state, action =  self.model.sample(self.env)
        state2, reward = self.model.step(state, action)
        self.learn(state, state2, reward, action)
  • Did you ever solve this? My own intuition is that perhaps the model is overfitting the training environment, leading to a policy that only works well in training. Then your test environment is too different and the policy fails horribly. I don't see any indication that you're setting the random seed, perhaps try fixing this to the same value in training and testing as a first step. If the Dyna-Q agent doesn't do well in testing here, then there's a bug in the agent itself. – chippies Nov 25 '20 at 15:40

2 Answers2

0

I guess it could be because the environment is stochastic. Learning the model in stochastic environment may lead to sub-optimal policies. In the Sutton & Barto's RLBook they say that they assume deterministic environment.

Eetu
  • 1
0

Check that after a model step is taken the planning steps sample from the next state ie state2.

If not, planning might be taking repeated steps from the same starting state given by self.env.

However, I may have misunderstood the role of the self.env parameter in self.model.sample(self.env)

Obi
  • 51
  • 4