I am new to reinforcement learning. I had recently learned about approximate q learning, or feature-based q learning, in which you describe states by features to save space. I have tried to implement this in a simple grid game. Here, the agent is supposed to learn to not go into a firepit(signaled by an f) and to instead eat up as much dots as possible. Here is the grid used:
...A
.f.f
.f.f
...f
Here A signals the agent's starting location. Now, when implementing, I set up two features. One was 1/((distance to closest dot)^2), and the other was (distance to firepit) + 1. When the agent enters a firepit, the program returns with a reward of -100. If it goes to a non firepit position that was already visited(and thus there is no dot to be eaten), the reward is -50. If it goes to an unvisited dot, the reward is +500. In the above grid, no matter what the initial weights are, the program never learns the correct weight values. Specifically, in the output, the first training session gains a score(how many dots it ate) of 3, but for all other training sessions, the score is just 1 and the weights converge to an incorrect value of -125 for weight 1(distance to firepit) and 25 for weight 2(distance to unvisited dot). Is there something specifically wrong with my code or is my understanding of approximate q learning incorrect?
I have tried to play around with the rewards that the environment is giving and also with the initial weights. None of these have fixed the problem. Here is the link to the entire program: https://repl.it/repls/WrongCheeryInterface
Here is what is going on in the main loop:
while(points != NUMPOINTS){
bool playerDied = false;
if(!start){
if(!atFirepit()){
r = 0;
if(visited[player.x][player.y] == 0){
points += 1;
r += 500;
}else{
r += -50;
}
}else{
playerDied = true;
r = -100;
}
}
//Update visited
visited[player.x][player.y] = 1;
if(!start){
//This is based off the q learning update formula
pairPoint qAndA = getMaxQAndAction();
double maxQValue = qAndA.q;
double sample = r;
if(!playerDied && points != NUMPOINTS)
sample = r + (gamma2 * maxQValue);
double diff = sample - qVal;
updateWeights(player, diff);
}
// checking end game condition
if(playerDied || points == NUMPOINTS) break;
pairPoint qAndA = getMaxQAndAction();
qVal = qAndA.q;
int bestAction = qAndA.a;
//update player and q value
player.x += dx[bestAction];
player.y += dy[bestAction];
start = false;
}
I would expect that both weights would still be positive, but one of them is negative(the one giving distance to the firepit).
I also expected the program to learn overtime that it is bad to enter a firepit and also bad, but not as bad, to go to an unvisited dot.