Problems with implementing approximate(feature based) q learning

Question

I am new to reinforcement learning. I had recently learned about approximate q learning, or feature-based q learning, in which you describe states by features to save space. I have tried to implement this in a simple grid game. Here, the agent is supposed to learn to not go into a firepit(signaled by an f) and to instead eat up as much dots as possible. Here is the grid used:

...A
.f.f
.f.f
...f

Here A signals the agent's starting location. Now, when implementing, I set up two features. One was 1/((distance to closest dot)^2), and the other was (distance to firepit) + 1. When the agent enters a firepit, the program returns with a reward of -100. If it goes to a non firepit position that was already visited(and thus there is no dot to be eaten), the reward is -50. If it goes to an unvisited dot, the reward is +500. In the above grid, no matter what the initial weights are, the program never learns the correct weight values. Specifically, in the output, the first training session gains a score(how many dots it ate) of 3, but for all other training sessions, the score is just 1 and the weights converge to an incorrect value of -125 for weight 1(distance to firepit) and 25 for weight 2(distance to unvisited dot). Is there something specifically wrong with my code or is my understanding of approximate q learning incorrect?

I have tried to play around with the rewards that the environment is giving and also with the initial weights. None of these have fixed the problem. Here is the link to the entire program: https://repl.it/repls/WrongCheeryInterface

Here is what is going on in the main loop:

while(points != NUMPOINTS){
bool playerDied = false;
if(!start){
  if(!atFirepit()){
    r = 0;
    if(visited[player.x][player.y] == 0){
      points += 1;
      r += 500;
    }else{
      r += -50;
    }
  }else{
    playerDied = true;
    r = -100;
  }
}

//Update visited
visited[player.x][player.y] = 1;

if(!start){
  //This is based off the q learning update formula
  pairPoint qAndA = getMaxQAndAction();
  double maxQValue = qAndA.q;
  double sample = r;
  if(!playerDied && points != NUMPOINTS)
    sample = r + (gamma2 * maxQValue);
  double diff = sample - qVal;
  updateWeights(player, diff);
}

// checking end game condition
if(playerDied || points == NUMPOINTS) break;

pairPoint qAndA = getMaxQAndAction();
qVal = qAndA.q;
int bestAction = qAndA.a;

//update player and q value
player.x += dx[bestAction];
player.y += dy[bestAction];

start = false;
}

I would expect that both weights would still be positive, but one of them is negative(the one giving distance to the firepit).

I also expected the program to learn overtime that it is bad to enter a firepit and also bad, but not as bad, to go to an unvisited dot.

score 1 · Accepted Answer · answered Apr 06 '19 at 10:18

Probably not the anwser you want to hear, but:

Have you try to implement the simpler tabular Q-learning before approximate Q-learning? In your setting, with a few states and actions, it will work pefectly. If you are learning, I strongly recommend you to start with the simpler cases in order to get a better understanding/intuition about how Reinforcement Learning works.
Do you know the implications of using approximators instead of learning the exact Q function? In some cases, due to the complexity of the problem (e.g., when the state space is continuous) you should approximate the Q function (or the policy, depending on the algorithm), but this may introduce some convergence problems. Additionally, in you case, you are trying to hand-pick some features, which usually required a depth knowledge of the problem (i.e., environment) and the learning algorithm.
Do you understand the meaning of the hyperparameters alpha and gamma? You can not choose them randomly. Sometimes they are critical to obtain the expected results, not always, depending heavely on the problem and the learning algorithm. In your case, taking a look to the convergence curve of you weights, it's pretty clear that you are using a value of alpha too high. As you pointed out, after the first training session your weigths remain constant.

Therefore, practical recommendations:

Be sure to solve your grid game using a tabular Q-learning algorithm before trying more complex things.
Experiment with different values of alpha, gamma and rewards.
Read more in depth about approximated RL. A very good and accesible book (starting from zero knowledge) is the classical Sutton and Barto's book: Reinforcement Learning: An Introduction, which you can obtain for free and was updated in 2018.

To answer your first question, I have implemented a q-learning algorithm for this grid world. I had in fact tried to use approximate q learning for a pacman game before deciding to switch to the grid world to for simplicity sakes. For the third part of your answer, yeah I don't think I have a very concrete understanding of those constants. I will try to play around with them, as well as the rewards, as I suspect my environment might be the source of some of my problems. Thanks for the response. — Love2Code, Apr 06 '19 at 21:22
Ok, well done, I think you are in the rigth direction! Don't be frustrated, usually solve complex problems (such as Pacman game) with RL is far from trivial. Another question, what kind of approximator have you tried with Pacman? Also hand-picked features? In my opinion your environment is a good step before Pacman, but maybe is useful to try even simpler environments before your environment. I guess you already know OpenAI Gym (https://gym.openai.com/), which contains some well-known environments (although the are written in Python). — Pablo EM, Apr 06 '19 at 23:51
In pacman, I only had two weights(which I now believe is one of the main reasons it didn't work). One weight was distance to closest ghost, and another was 1/(distance to closest dot squared). And I'll definitely try using open AI gym, as I feel the environments will be set up better. — Love2Code, Apr 07 '19 at 05:20
Yes, probably is a key point to take into account. Notice that, intuitively, your features (distance to closest dot and distance to firepit) don't provide enough information to the agent. Which direction should take the agent knowing only distances to dots and firepit? It knows nothing about the position of the firepit and the dot. — Pablo EM, Apr 07 '19 at 18:55
A little off-topic, but as you are new to Stackoverflow, maybe this link is relevant to you: https://stackoverflow.com/help/accepted-answer — Pablo EM, Apr 07 '19 at 19:09

Problems with implementing approximate(feature based) q learning

1 Answers1