What's wrong with Dyna-Q ? (Dyna-Q vs Q-learning)

Question

I implemented the Q-learning algorithm and used it on FrozenLake-v0 on OpenAI gym. I am getting 185 total rewards during training and 7333 total rewards during testing in 10000 episodes. Is this good ?

Also I tried the Dyna-Q algorithm. But it is giving worse performance than Q-learning. Approx. 200 total rewards during training and 700-900 total rewards during testing in 10000 episodes with 50 planning steps.

Why is this happening ?

Below is the code. Is something wrong with the code ?

# Setup
env = gym.make('FrozenLake-v0')

epsilon = 0.9
lr_rate = 0.1
gamma = 0.99
planning_steps = 0

total_episodes = 10000
max_steps = 100

Training and testing():

while t < max_steps:
    action = agent.choose_action(state)  
    state2, reward, done, info = agent.env.step(action)  
    # Removed in testing
    agent.learn(state, state2, reward, action)
    agent.model.add(state, action, state2, reward)
    agent.planning(planning_steps)
    # Till here
    state = state2

def add(self, state, action, state2, reward):
        self.transitions[state, action] = state2
        self.rewards[state, action] = reward

def sample(self, env):
    state, action = 0, 0
    # Random visited state
    if all(np.sum(self.transitions, axis=1)) <= 0:
        state = np.random.randint(env.observation_space.n)
    else:
        state = np.random.choice(np.where(np.sum(self.transitions, axis=1) > 0)[0])

    # Random action in that state
    if all(self.transitions[state]) <= 0:
        action = np.random.randint(env.action_space.n)
    else:    
        action = np.random.choice(np.where(self.transitions[state] > 0)[0])
    return state, action

def step(self, state, action):
    state2 = self.transitions[state, action]
    reward = self.rewards[state, action]
    return state2, reward

def choose_action(self, state):
    if np.random.uniform(0, 1) < epsilon:
        return self.env.action_space.sample()
    else:
        return np.argmax(self.Q[state, :])

def learn(self, state, state2, reward, action):
    # predict = Q[state, action]
    # Q[state, action] = Q[state, action] + lr_rate * (target - predict)
    target = reward + gamma * np.max(self.Q[state2, :])
    self.Q[state, action] = (1 - lr_rate) * self.Q[state, action] + lr_rate * target

def planning(self, n_steps):
    # if len(self.transitions)>planning_steps:
    for i in range(n_steps):
        state, action =  self.model.sample(self.env)
        state2, reward = self.model.step(state, action)
        self.learn(state, state2, reward, action)

Did you ever solve this? My own intuition is that perhaps the model is overfitting the training environment, leading to a policy that only works well in training. Then your test environment is too different and the policy fails horribly. I don't see any indication that you're setting the random seed, perhaps try fixing this to the same value in training and testing as a first step. If the Dyna-Q agent doesn't do well in testing here, then there's a bug in the agent itself. — chippies, Nov 25 '20 at 15:40

score 0 · Answer 1 · answered Jul 06 '20 at 11:19

0

I guess it could be because the environment is stochastic. Learning the model in stochastic environment may lead to sub-optimal policies. In the Sutton & Barto's RLBook they say that they assume deterministic environment.

answered Jul 06 '20 at 11:19

Eetu

1

But Q-learning worked. Can you tell some simulations where i could test the dyna-q implementation ? – Adesh Gautam Jul 10 '20 at 08:28

Obi · Answer 2 · 2021-01-12T14:33:15.597

0

Check that after a model step is taken the planning steps sample from the next state ie state2.

If not, planning might be taking repeated steps from the same starting state given by self.env.

However, I may have misunderstood the role of the self.env parameter in self.model.sample(self.env)

edited Jan 12 '21 at 14:33

answered Jan 05 '21 at 23:12

Obi

51
4

1

Ok, I have re-phrased into a conditional answer – Obi Jan 12 '21 at 14:33

What's wrong with Dyna-Q ? (Dyna-Q vs Q-learning)

2 Answers2