2

I am trying to implement a linear function approximation for solving MountainCar using q-learning. I know this environment can't be perfectly approximated with a linear function due to the spiral-like shape of the optimal policy, but the behaviour I am getting is quite strange.

Returns

I don't understand why the reward goes up until it reaches what seems convergence and then starts going down

Please, find my code attached. I would be very glad if somebody can give me any idea of what am I doing bad.

Initializations

import gym
import random
import os

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
 
class Agent:
    def __init__(self, gamma: float, epsilon: float, alpha:float, n_actions: int, n_steps:int=1):
        self.n_steps=n_steps
        self.gamma=gamma
        self.epsilon=epsilon
        self.alpha=alpha
        self.n_actions=n_actions
        self.state_action_values={}
        self.state_values={}
        self.w=None

    def get_next_action(self, state):
        raise NotImplementedError

    def update(self, state, action: int, reward, state_prime):
        raise NotImplementedError

    def reset(self):
        # Optional argument
        pass

Q-Learning Agent

class FunctionApproximationQLearning(Agent):
    def __init__(self, gamma, epsilon, alpha, n_actions, n_features):
        super().__init__(gamma, epsilon, alpha, n_actions)
        self.w = np.zeros((n_features, n_actions))

    def get_next_action(self, x):
        if random.random()>self.epsilon:
            return np.argmax(self._lr_predict(x))
        else:
            return np.random.choice(range(self.n_actions))

    def update(self, state, action, reward, state_prime, done):
        if not done:
            td_target = reward + self.gamma*np.max(self._lr_predict(state_prime))
        else:
            td_target = reward
        # Target definition
        target = self._lr_predict(state)
        target[action] = td_target
        # Function approximation
        self._lr_fit(state, target)

    def _lr_predict(self, x):
        # x should be (1, n_features)
        #x = np.concatenate([x, [1]])
        return x @ self.w

    def _lr_fit(self, x, target):
        pred = self._lr_predict(x)
        #x = np.concatenate([x, [1]])

        if len(x.shape)==1:
            x = np.expand_dims(x, 0)
        if len(target.shape)==1:
            target = np.expand_dims(target,1)
        self.w += self.alpha*((np.array(target)-np.expand_dims(pred, 1))@x ).transpose()

Execution

env = gym.make("MountainCar-v0").env
state = env.reset()
agent = FunctionApproximationQLearning(gamma=0.99, alpha=0.001, epsilon=0.1,
                                       n_actions=env.action_space.n, 
                                       n_features=env.observation_space.shape[0])

rewards=[]
pos=[]
for episode in range(1000000):
    done = False
    cumreward=0
    poss=[]
    state = env.reset()
    action = agent.get_next_action(state)
    c=0

    while not done and c<500:
        action = agent.get_next_action(state)
        next_state, reward, done, _ = env.step(action)
        agent.update(state, action, reward, next_state, done)
        state = next_state
        cumreward+=reward
        c+=1
        poss=state[0]

    rewards.append(cumreward)  
    if np.mean(rewards[-100:])>950:
        break
    pos.append(np.max(poss))
    if episode % 100 == 0:
        clear_output(True)
        plt.plot(pd.Series(rewards).ewm(span=1000).mean())
        plt.title("Returns evolution")
        plt.xlabel("Episodes")
        plt.ylabel("Return")
        plt.show()
ivallesp
  • 2,018
  • 1
  • 14
  • 21

1 Answers1

2

Let me know if I'm wrong, but it seems you are trying to use a linear function approximator using directly as features the state variables, i.e., car position and velocity. In such a case, it is not only not possible to perfectly approximate the value function, but it is impossible to approximate something close to the optimal value function. Therefore, despite you figure seems to suggest some convergence, I'm pretty sure it is not the case.

A very good feature of two dimensional toy environments, such as the MountainCar, is that you are able to plot the approximated Q-value functions. In Sutton & Barto book (chapter 8, Figure 8.10) you can find the "cost-to-go" function (easily obtained from Q-values) through the learning process. As you can see, the function is highly non-linear with car position and velocity. My advice is to plot the same cost-to-go function and verify that they are similar to the ones shown in the book.

Using linear function approximators with Q-learning usually requires (except in very specific cases) compute a set the features, so your approximator is linear with respect to the extracted features, no the original ones. In this way you can approximate non-linear functions (with respect to the original state variables, of course). An extended explanation of this concept can be found, again, in Sutton & Barto book, section 8.3.

Pablo EM
  • 6,190
  • 3
  • 29
  • 37
  • Thank you so much for the explaination. I am actually using the state representation as features (which consist of the position and the velocity), knowing that it is not going to arrive to a very good solution. The only thing I don't actually understand is why the return decreases with time!. I mean, it should be impossible when using stochastic gradient descent. – ivallesp Aug 31 '18 at 10:48
  • Well, your function approximator can represent only a plane wrt car position and velocity. So I guess the weights are updated in order to represent the value function, which seems to work during some time (returns go toward 0), but a some time point it diverges and the returns get more and more negative. Why do you think the return can not decrease with time? – Pablo EM Aug 31 '18 at 14:50
  • I think it cannot decrease with time because I am actually following the gradient so that the weights of the linear regression are adjusted towards the TD-target. Considering that, how is it possible to perform updates in a direction that the value function becomes more and more inexact? I can assume that, because it is gradient descent, the figure is not always monotonically increasing, but there is one point in which it becomes monotonically decreasing and that can't fit in my brain. I checked also that the weights don't grow towards infinity... – ivallesp Aug 31 '18 at 17:58
  • Oh, I forgot it, I have been able to solve it by making a simple trick that I read by there: repeating the same action n times (in this case n=4). I arrive to an average cumulative reward of ~40. Still, I want to understand why it diverges. Thanks so much. – ivallesp Aug 31 '18 at 17:59
  • 1
    Even you are using SGD, the return is not the same that your value function error. So, even if your value function converges to something, it doesn’t mean that the return has to improve. In addition, Q-learning uses bootstrap (not true SGD), so it can be more unstable and the Q-values can diverge, although this seems to be not your case because you have checked your weights. – Pablo EM Sep 01 '18 at 00:35
  • Still not convinced @ivallesp? Don't hesitate in further discussion if it's not clear. – Pablo EM Sep 02 '18 at 22:25
  • Not totally haha. My main concern is that I tried to run the algorithm several times and, sometimes I see the curve I attached to the post, sometimes it converges to the optimal policy, and sometimes it diverges from the begining. It is quite strange and I am worried because I don't know if it is normal or if it is an implementation error haha. – ivallesp Sep 03 '18 at 11:23
  • Well, if you are still using car position and velocity as features, probably anything can happen. Of course, I can not ensure it is not due to a code bug. IMO, if you want to understand what is happening, the best way is to plot the value functions (or cost-to-go) function at different time steps. As you know how it should look, you will understand if the problem is related with your approximator (my bet ;-) ) or you should look in other place. – Pablo EM Sep 05 '18 at 09:48