How to implement q-learning in R?

Question

I am learning about q-learning and found a Wikipedia post and this website.

According to the tutorials and pseudo code I wrote this much in R

#q-learning example
#http://mnemstudio.org/path-finding-q-learning-tutorial.htm
#https://en.wikipedia.org/wiki/Q-learning

set.seed(2016)
iter=100
dimension=5;
alpha=0.1 #learning rate
gamma=0.8 #exploration/ discount factor

# n x n matrix
Q=matrix( rep( 0, len=dimension*dimension), nrow = dimension)
Q

# R -1 is fire pit,0 safe path and 100 Goal state########
R=matrix( sample( -1:0, dimension*dimension,replace=T,prob=c(1,2)), nrow = dimension)
R[dimension,dimension]=100
R #reward matrix
################

for(i in 1:iter){
  row=sample(1:dimension,1)
  col=sample(1:dimension,1)
  I=Q[row,col] #randomly choosing initial state

  Q[row,col]=Q[row,col]+alpha*(R[row,col]+gamma*max(Qdash-Q[row,col])
  #equation from wikipedia

}

But I have problem in max(Qdash-Q[row,col] which according to the website is Max[Q(next state, all actions)] How to I programmatically search all actions for next state?

The second problem is this pseudo code

Do While the goal state hasn't been reached.

Select one among all possible actions for the current state.
Using this possible action, consider going to the next state.
Get maximum Q value for this next state based on all possible actions.
Compute: Q(state, action) = R(state, action) + Gamma * Max[Q(next state, all actions)]
Set the next state as the current state.
End Do

Is it this

while(Q<100){
Q[row,col]=Q[row,col]+alpha*(R[row,col]+gamma*max(Qdash-Q[row,col])
}

aichao · Accepted Answer · 2016-09-07T01:06:20.663

This post is by no means a complete implementation of Q-learning in R. It is an attempt to answer the OP with regards to the description of the algorithm in the website linked in the post and in Wikipedia.

The assumption here is that the reward matrix R is as described in the website. Namely that it encodes reward values for possible actions as non-negative numbers, and -1's in the matrix represent null values (i.e., where there is no possible action to transition to that state).

With this setup, an R implementation of the Q update is:

Q[cs,ns] <- Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) - Q[cs,ns])

where

cs is the current state at the current point in the path.
ns is the new state based on a (randomly) chosen action at the current state. This action is chosen from the collection of possible actions at the current state (i.e., for which R[cs,] > -1). Since the state transition itself is deterministic here, the action is the transition to the new state.
For this action resulting in ns, we want to add its maximum (future) value over all possible actions that can be taken at ns. This is the so-called Max[Q(next state, all actions)] term in the linked website and the "estimate of optimal future value" in Wikipedia. To compute this, we want to maximize over the ns-th row of Q but consider only columns of Q for which columns of R at the corresponding ns-th row are valid actions (i.e., for which R[ns,] > -1). Therefore, this is:
```
max(Q[ns, which(R[ns,] > -1)])
```
An interpretation of this value is a one-step look ahead value or an estimate of the cost-to-go in dynamic programming.
The equation in the linked website is the special case in which alpha, the learning rate, is 1. We can view the equation in Wikipedia as:
```
Q[cs,ns] <- (1-alpha)*Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]))
```
where alpha "interpolates" between the old value Q[cs,ns] and the learned value R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]). As noted in Wikipedia,

In fully deterministic environments, a learning rate of alpha=1 is optimal

Putting it all together into a function:

q.learn <- function(R, N, alpha, gamma, tgt.state) {
  ## initialize Q to be zero matrix, same size as R
  Q <- matrix(rep(0,length(R)), nrow=nrow(R))
  ## loop over episodes
  for (i in 1:N) {
    ## for each episode, choose an initial state at random
    cs <- sample(1:nrow(R), 1)
    ## iterate until we get to the tgt.state
    while (1) {
      ## choose next state from possible actions at current state
      ## Note: if only one possible action, then choose it;
      ## otherwise, choose one at random
      next.states <- which(R[cs,] > -1)
      if (length(next.states)==1)
        ns <- next.states
      else
        ns <- sample(next.states,1)
      ## this is the update
      Q[cs,ns] <- Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) - Q[cs,ns])
      ## break out of while loop if target state is reached
      ## otherwise, set next.state as current.state and repeat      
      if (ns == tgt.state) break
      cs <- ns
    }
  }
  ## return resulting Q normalized by max value
  return(100*Q/max(Q))
}

where the input parameters are:

R is the rewards matrix as defined in the blog
N is the number of episodes to iterate
alpha is the learning rate
gamma is the discount factor
tgt.state is the target state of the problem.

Using the example in the linked website as a test:

N <- 1000
alpha <- 1
gamma <- 0.8
tgt.state <- 6
R <- matrix(c(-1,-1,-1,-1,0,-1,-1,-1,-1,0,-1,0,-1,-1,-1,0,-1,-1,-1,0,0,-1,0,-1,0,-1,-1,0,-1,0,-1,100,-1,-1,100,100),nrow=6)
print(R)
##     [,1] [,2] [,3] [,4] [,5] [,6]
##[1,]   -1   -1   -1   -1    0   -1
##[2,]   -1   -1   -1    0   -1  100
##[3,]   -1   -1   -1    0   -1   -1
##[4,]   -1    0    0   -1    0   -1
##[5,]    0   -1   -1    0   -1  100
##[6,]   -1    0   -1   -1    0  100

Q <- q.learn(R,iter,alpha,gamma,tgt.state)
print(Q)
##     [,1] [,2] [,3] [,4] [,5]      [,6]
##[1,]    0    0  0.0    0   80   0.00000
##[2,]    0    0  0.0   64    0 100.00000
##[3,]    0    0  0.0   64    0   0.00000
##[4,]    0   80 51.2    0   80   0.00000
##[5,]   64    0  0.0   64    0 100.00000
##[6,]    0   80  0.0    0   80  99.99994

what is the logic behind choosing ` Q[cs,ns]` (both are states) instead of `Q[s,a]`? — Eka, Sep 07 '16 at 04:14
@Eka: in the deterministic case, each possible action at `cs` is equivalent to the transition from `cs` to the `ns` corresponding to that action. I chose `ns` so that it would be easier to convey that the columns of `Q` and `R` (actions) also deterministically correspond to the next state. In the Markov case (i.e., Markov Decision Process), this is not so because an action results in a probabilistic transition to possibly more than one state. There, `R(s,a,ns)`, and it is important to denote `Q[s,a]`. The update there adds the expected learned value over all possible `ns`. — aichao, Sep 07 '16 at 12:44

How to implement q-learning in R?

1 Answers1