This post is by no means a complete implementation of Q-learning in R. It is an attempt to answer the OP with regards to the description of the algorithm in the website linked in the post and in Wikipedia.
The assumption here is that the reward matrix R
is as described in the website. Namely that it encodes reward values for possible actions as non-negative numbers, and -1's in the matrix represent null values (i.e., where there is no possible action to transition to that state).
With this setup, an R implementation of the Q
update is:
Q[cs,ns] <- Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) - Q[cs,ns])
where
cs
is the current state at the current point in the path.
ns
is the new state based on a (randomly) chosen action at the current state. This action is chosen from the collection of possible actions at the current state (i.e., for which R[cs,] > -1
). Since the state transition itself is deterministic here, the action is the transition to the new state.
For this action resulting in ns
, we want to add its maximum (future) value over all possible actions that can be taken at ns
. This is the so-called Max[Q(next state, all actions)]
term in the linked website and the "estimate of optimal future value" in Wikipedia. To compute this, we want to maximize over the ns
-th row of Q
but consider only columns of Q
for which columns of R
at the corresponding ns
-th row are valid actions (i.e., for which R[ns,] > -1
). Therefore, this is:
max(Q[ns, which(R[ns,] > -1)])
An interpretation of this value is a one-step look ahead value or an estimate of the cost-to-go in dynamic programming.
The equation in the linked website is the special case in which alpha
, the learning rate, is 1
. We can view the equation in Wikipedia as:
Q[cs,ns] <- (1-alpha)*Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]))
where alpha
"interpolates" between the old value Q[cs,ns]
and the learned value R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)])
. As noted in Wikipedia,
In fully deterministic environments, a learning rate of alpha=1
is optimal
Putting it all together into a function:
q.learn <- function(R, N, alpha, gamma, tgt.state) {
## initialize Q to be zero matrix, same size as R
Q <- matrix(rep(0,length(R)), nrow=nrow(R))
## loop over episodes
for (i in 1:N) {
## for each episode, choose an initial state at random
cs <- sample(1:nrow(R), 1)
## iterate until we get to the tgt.state
while (1) {
## choose next state from possible actions at current state
## Note: if only one possible action, then choose it;
## otherwise, choose one at random
next.states <- which(R[cs,] > -1)
if (length(next.states)==1)
ns <- next.states
else
ns <- sample(next.states,1)
## this is the update
Q[cs,ns] <- Q[cs,ns] + alpha*(R[cs,ns] + gamma*max(Q[ns, which(R[ns,] > -1)]) - Q[cs,ns])
## break out of while loop if target state is reached
## otherwise, set next.state as current.state and repeat
if (ns == tgt.state) break
cs <- ns
}
}
## return resulting Q normalized by max value
return(100*Q/max(Q))
}
where the input parameters are:
R
is the rewards matrix as defined in the blog
N
is the number of episodes to iterate
alpha
is the learning rate
gamma
is the discount factor
tgt.state
is the target state of the problem.
Using the example in the linked website as a test:
N <- 1000
alpha <- 1
gamma <- 0.8
tgt.state <- 6
R <- matrix(c(-1,-1,-1,-1,0,-1,-1,-1,-1,0,-1,0,-1,-1,-1,0,-1,-1,-1,0,0,-1,0,-1,0,-1,-1,0,-1,0,-1,100,-1,-1,100,100),nrow=6)
print(R)
## [,1] [,2] [,3] [,4] [,5] [,6]
##[1,] -1 -1 -1 -1 0 -1
##[2,] -1 -1 -1 0 -1 100
##[3,] -1 -1 -1 0 -1 -1
##[4,] -1 0 0 -1 0 -1
##[5,] 0 -1 -1 0 -1 100
##[6,] -1 0 -1 -1 0 100
Q <- q.learn(R,iter,alpha,gamma,tgt.state)
print(Q)
## [,1] [,2] [,3] [,4] [,5] [,6]
##[1,] 0 0 0.0 0 80 0.00000
##[2,] 0 0 0.0 64 0 100.00000
##[3,] 0 0 0.0 64 0 0.00000
##[4,] 0 80 51.2 0 80 0.00000
##[5,] 64 0 0.0 64 0 100.00000
##[6,] 0 80 0.0 0 80 99.99994