n-armed bandit simulation in R

Question

I'm using Sutton & Barto's ebook Reinforcement Learning: An Introduction to study reinforcement learning. I'm having some issues trying to emulate the results (plots) on the action-value page.

More specifically, how can I simulate the greedy value for each task? The book says:

...we can plot the performance and behavior of various methods as they improve with experience over 1000 plays...

So I guess I have to keep track of the exploratory values as better ones are found. The issue is how to do this using the greedy approach - since there are no exploratory moves, how do I know what is a greedy behavior?

Thanks for all the comments and answers!

UPDATE: See code on my answer.

What output are you getting and what do you expect to be getting (also, you might want to add a `seed()` so that others can replicate.) — Ricardo Saporta, Jul 29 '13 at 21:39
The problem is to get the *greedy* value, given a vector V (each column of x). If runif(1) < eps, just get a random value from V, otherwise get the *greedy* value of V, which i think is the mean, but the plots doesn't look right. — Fernando, Jul 29 '13 at 21:45
Perhaps modify the code so that the plots are generated simply by copying and pasting your code. When I copy and paste your code nothing is returned. I have now gotten the first two functions to return data, but have not yet gotten the plots generated. — Mark Miller, Jul 29 '13 at 21:55
I might also reduce `n` and `play` until the code is working as desired. — Mark Miller, Jul 29 '13 at 21:58
Something is odd here... every row in `rewards.greedy` is going to be identical, since each element is going to be the `max` of its respective column counterpart in `x`. How is greedy supposed to be selected? — Ricardo Saporta, Jul 30 '13 at 02:23
Fernando, if `arms` is the number of actions to choose from, then what does `n` represent? What is the relationship between `n` & `arms` (or `n` and `plays`)? — Ricardo Saporta, Jul 30 '13 at 02:41
*n* is the number of simulations. The code generates *n* simulations, each one has *arms* tasks. Each simulation will be 'played' *play* times. There's no relation between *n* and *play* as far as i know. — Fernando, Jul 30 '13 at 02:47
So then is it fair to assume that for each `n` (for each simulation), the rewards should be averaged by `arm`? — Ricardo Saporta, Jul 30 '13 at 02:54
The rewards should be averaged by play. each simulation output (average reward) is the result of a play action (look at the *apply* call in the *run.simulation* function). — Fernando, Jul 30 '13 at 02:59
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/34412/discussion-between-ricardo-saporta-and-fernando) — Ricardo Saporta, Jul 30 '13 at 03:01

Fernando · Accepted Answer · 2013-08-02T19:48:40.283

I finally got this right. The eps player should beat the greedy player because of the exploratory moves, as pointed out int the book. The code is slow and need some optimizations, but here it is:

enter image description here

get.testbed = function(arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1){

  optimal = rnorm(arms, u, sdev.arm)
  rewards = sapply(optimal, function(x)rnorm(plays, x, sdev.rewards))

  list(optimal = optimal, rewards = rewards)
}

play.slots = function(arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1, eps = 0.1){

  testbed = get.testbed(arms, plays, u, sdev.arm, sdev.rewards)
  optimal = testbed$optimal
  rewards = testbed$rewards

  optim.index = which.max(optimal)
  slot.rewards = rep(0, arms)
  reward.hist = rep(0, plays)
  optimal.hist = rep(0, plays)
  pulls = rep(0, arms)
  probs = runif(plays)

  # vetorizar
  for (i in 1:plays){

      ## dont use ifelse() in this case
      ## idx = ifelse(probs[i] < eps, sample(arms, 1), which.max(slot.rewards))

      idx = if (probs[i] < eps) sample(arms, 1) else which.max(slot.rewards)
      reward.hist[i] = rewards[i, idx]

      if (idx == optim.index)
        optimal.hist[i] = 1

      slot.rewards[idx] = slot.rewards[idx] + (rewards[i, idx] - slot.rewards[idx])/(pulls[idx] + 1)
      pulls[idx] = pulls[idx] + 1
  }

  list(slot.rewards = slot.rewards, reward.hist = reward.hist, optimal.hist = optimal.hist, pulls = pulls)
}

do.simulation = function(N = 100, arms = 10, plays = 500, u = 0, sdev.arm = 1, sdev.rewards = 1, eps = c(0.0, 0.01, 0.1)){

  n.players = length(eps)
  col.names = paste('eps', eps)
  rewards.hist = matrix(0, nrow = plays, ncol = n.players)
  optim.hist = matrix(0, nrow = plays, ncol = n.players)
  colnames(rewards.hist) = col.names
  colnames(optim.hist) = col.names

  for (p in 1:n.players){
    for (i in 1:N){
      play.results = play.slots(arms, plays, u, sdev.arm, sdev.rewards, eps[p])
      rewards.hist[, p] = rewards.hist[, p] + play.results$reward.hist
      optim.hist[, p] = optim.hist[, p] + play.results$optimal.hist
    } 
  }

  rewards.hist = rewards.hist/N
  optim.hist = optim.hist/N
  optim.hist = apply(optim.hist, 2, function(x)cumsum(x)/(1:plays))

  ### Plot helper ###
  plot.result = function(x, n.series, colors, leg.names, ...){
    for (i in 1:n.series){
      if (i == 1)
        plot.ts(x[, i], ylim = 2*range(x), col = colors[i], ...)
      else
        lines(x[, i], col = colors[i], ...)
      grid(col = 'lightgray')
    }
    legend('topleft', leg.names, col = colors, lwd = 2, cex = 0.6, box.lwd = NA)
  }
  ### Plot helper ###

  #### Plots ####
  require(RColorBrewer)
  colors = brewer.pal(n.players + 3, 'Set2')
  op <-par(mfrow = c(2, 1), no.readonly = TRUE)

  plot.result(rewards.hist, n.players, colors, col.names, xlab = 'Plays', ylab = 'Average reward', lwd = 2)
  plot.result(optim.hist, n.players, colors, col.names, xlab = 'Plays', ylab = 'Optimal move %', lwd = 2)
  #### Plots ####

  par(op)
}

To run it just call

do.simulation(N = 100, arms = 10, eps = c(0, 0.01, 0.1))

This is very nice. Would you be willing to post code that compares three strategies, as in the book? Maybe I can go through this later and figure out how to add a third strategy. Thanks either way for posting this. — Mark Miller, Aug 02 '13 at 11:04
Very nice code indeed. Just a minor comment regarding the code line
idx = if (probs[i] < eps) sample(arms, 1) else which.max(slot.rewards)
When slot.rewards = rep(0, arms) then systematically which.max(slot.rewards) = 1, while it is better (to avoid anchoring into the first arm) to draw an arm randomly. — Bertrand, Nov 04 '20 at 16:44

Robin van Emden · Answer 2 · 2019-03-19T09:32:50.467

You could also choose to make use of the R package "contextual", which aims to ease the implementation and evaluation of both context-free (as described in Sutton & Barto) and contextual (such as for example LinUCB) Multi-Armed Bandit policies.

The package actually offers a vignette on how to replicate all Sutton & Barto bandit plots. For example, to generate the ε-greedy plots, just simulate EpsilonGreedy policies against a Gaussian bandit :

library(contextual)

set.seed(2)
mus             <- rnorm(10, 0, 1)
sigmas          <- rep(1, 10)
bandit          <- BasicGaussianBandit$new(mu_per_arm = mus, sigma_per_arm = sigmas)

agents          <- list(Agent$new(EpsilonGreedyPolicy$new(0),    bandit, "e = 0, greedy"),
                        Agent$new(EpsilonGreedyPolicy$new(0.1),  bandit, "e = 0.1"),
                        Agent$new(EpsilonGreedyPolicy$new(0.01), bandit, "e = 0.01"))

simulator       <- Simulator$new(agents = agents, horizon = 1000, simulations = 2000)
history         <- simulator$run()

plot(history, type = "average", regret = FALSE, lwd = 1, legend_position = "bottomright")
plot(history, type = "optimal", lwd = 1, legend_position = "bottomright")

Full disclosure: I am one of the developers of the package.

Ricardo Saporta · Answer 3 · 2013-07-30T16:36:29.910

this is what I have so far based on our chat:

set.seed(1)

getRewardsGaussian <- function(arms, plays) {
## assuming each action has a normal distribution 

  # first generate new means
  QStar <- rnorm(arms, 0, 1)

  # then for each mean, generate `play`-many samples
  sapply(QStar, function(u)
    rnorm(plays, u, 1))
}


CalculateRewardsPerMethod <- function(arms=7, epsi1=0.01, epsi2=0.1
                    , plays=1000, methods=c("greedy", "epsi1", "epsi2")) {

  # names for easy handling
  names(methods) <- methods
  arm.names <- paste0("Arm", ifelse((1:arms)<10, 0, ""), 1:arms)

  # this could be different if not all actions' rewards have a gaussian dist.
  rewards.source <- getRewardsGaussian(arms, plays) 

  # Three dimensional array to track running averages of each method
  running.avgs <- 
    array(0, dim=c(plays, arms, length(methods))
           , dimnames=list(PlayNo.=NULL, Arm=arm.names, Method=methods))

  # Three dimensional array to track the outcome of each play, according to each method 
  rewards.received <- 
    array(NA_real_, dim=c(plays, 2, length(methods))
                  , dimnames=list(PlayNo.=seq(plays), Outcome=c("Arm", "Reward"), Method=methods))


  # define the function internally to not have to pass running.avgs 
  chooseAnArm <- function(p) {
    # Note that in a tie, which.max returns the lowest value, which is what we want
    maxes <- apply(running.avgs[p, ,methods, drop=FALSE], 3, which.max)

    # Note: deliberately drawing two separate random numbers and keeping this as 
    #       two lines of code to accent that the two draws should not be related 
    if(runif(1) < epsi1)
      maxes["epsi1"] <- sample(arms, 1)

    if(runif(1) < epsi2)
      maxes["epsi2"] <- sample(arms, 1)

    return(maxes)
  }

  ## TODO:  Perform each action at least once, then select according to algorithm
  ## Starting points. Everyone starts at machine 3
  choice <- c(3, 3, 3)
  reward <- rewards.source[1, choice]
  ## First run, slightly different
  rewards.received[1,,] <- rbind(choice, reward)
  running.avgs[1, choice, ] <- reward # if different starting points, this needs to change like below

  ## HERE IS WHERE WE START PULLING THE LEVERS ##
  ## ----------------------------------------- ##
  for (p in 2:plays) {
    choice <- chooseAnArm(p)
    reward <- rewards.source[p, choice]

    # Note: When dropping a dim, the methods will be the columns 
    #       and the Outcome info will be the rows. Use `rbind` instead of `cbind`.
    rewards.received[p,,names(choice)] <- rbind(choice, reward)

    ## Update the running averages. 
    ## For each method, the current running averages are the same as the
    ##    previous for all arms, except for the one chosen this round.
    ##    Thus start with last round's averages, then update the one arm.
    running.avgs[p,,] <- running.avgs[p-1,,]

    # The updating is only involved part (due to lots of array-indexing)
    running.avgs[p,,][cbind(choice, 1:3)] <- 
     sapply(names(choice), function(m) 
       # Update the running average for the selected arm (for the current play & method) 
          mean( rewards.received[ 1:p,,,drop=FALSE][ rewards.received[1:p,"Arm",m] == choice[m],"Reward",m])
     )
  } # end for-loop


  ## DIFFERENT RETURN OPTIONS ##
  ## ------------------------ ##


  ## All rewards received, in simplifed matrix (dropping information on arm chosen)
  # return(rewards.received[, "Reward", ])

  ## All rewards received, along with which arm chosen: 
  #   return(rewards.received)

  ## Running averages of the rewards received by method
  return( apply(rewards.received[, "Reward", ], 2, cumsum) / (1:plays) )

}


### EXECUTION (AND SIMULATION)

## PARAMETERS
arms   <- 10
plays  <- 1000
epsi1  <- 0.01
epsi2  <- 0.1
simuls <- 50  # 2000
methods=c("greedy", "epsi1", "epsi2")

## Single Iteration: 
### we can run system time to get an idea for how long one will take
tme <- system.time( CalculateRewardsPerMethod(arms=arms, epsi1=epsi1, epsi2=epsi2, plays=plays) )
cat("Expected run time is approx: ", round((simuls * tme[["elapsed"]]) / 60, 1), " minutes")

## Multiple iterations (simulations)
rewards.received.list <- replicate(simuls, CalculateRewardsPerMethod(arms=arms, epsi1=epsi1, epsi2=epsi2, plays=plays), simplify="array")

## Compute average across simulations
rewards.received <- apply(rewards.received.list, 1:2, mean)

## RESULTS
head(rewards.received, 17)
MeanRewards <- rewards.received

## If using an alternate return method in `Calculate..` use the two lines below to calculate running avg
#   CumulRewards <- apply(rewards.received, 2, cumsum)
#   MeanRewards  <- CumulRewards / (1:plays)

## PLOT
plot.ts(MeanRewards[, "greedy"], col = 'red', lwd = 2, ylim = range(MeanRewards), ylab = 'Average reward', xlab="Plays")
  lines(MeanRewards[, "epsi1"], col = 'orange', lwd = 2)
  lines(MeanRewards[, "epsi2"], col = 'navy', lwd = 2)
  grid(col = 'darkgray')

  legend('bottomright', c('greedy', paste("epsi1 =", epsi1), paste("epsi2 =", epsi2)), col = c('red', 'orange', 'navy'), lwd = 2, cex = 0.8)

enter image description here

If you know Lisp, I think the Lisp code is here: http://webdocs.cs.ualberta.ca/~sutton/book/code/testbed.lisp Maybe it can be translated into R. The R Package `bandit` might help, but I do not know. — Mark Miller, Jul 30 '13 at 07:32
Testing your code, it looks nice. I update mine, the only issue know is how to 'update' the greedy value when eps = 0 (100% greedy). IN the other cases, it works ok. — Fernando, Jul 30 '13 at 15:59
The greedy method is always included. It updates in the line ` running.avgs[p,,][cbind(choice, 1:3)] <- .....` (which is then used in the function `ChooseAnArm`) — Ricardo Saporta, Jul 30 '13 at 16:30
I still don't get how to 'play greedy'. My greedy curve is always 0. — Fernando, Jul 30 '13 at 18:16
"greedy" is the following: At play p (say p = 137), for each arm a, calculate the average payout for each `a`. This is the Total rewards from _just that arm_ divided by the number of times that arm has been pulled. (Notice that all of these pieces of information have to be tracked. The "greedy" move is: at each point p, chose the arm with the highest average reward at that point. The other thing to notice is that this will of course change over time. Arm 3 may have a higher reward in general, but maybe Arm 7 is having a hot run and only giving winners. — Ricardo Saporta, Jul 30 '13 at 18:21
Cool, i'll try to update my code. But why your plot scale seems different from the book? — Fernando, Jul 30 '13 at 18:24
I'm not clear on which calculation the book uses to compute the average over tasks. I ran all simulations then averaged at the end. If you have any clarification on this, I'd be curious as to your take — Ricardo Saporta, Jul 30 '13 at 18:26
@Fernando, very cool! I will go through it later when I have some more free time — Ricardo Saporta, Jul 31 '13 at 16:31

Nikos Tsagkas · Answer 4 · 2017-11-12T21:42:33.273

You may also want to check this link https://www.datahubbs.com/multi_armed_bandits_reinforcement_learning_1/

Copy of the relevant code from the above source It does not use R but simply np.random.rand() from numpy

class eps_bandit:
'''
epsilon-greedy k-bandit problem

Inputs
=====================================================
k: number of arms (int)
eps: probability of random action 0 < eps < 1 (float)
iters: number of steps (int)
mu: set the average rewards for each of the k-arms.
    Set to "random" for the rewards to be selected from
    a normal distribution with mean = 0. 
    Set to "sequence" for the means to be ordered from 
    0 to k-1.
    Pass a list or array of length = k for user-defined
    values.
'''

def __init__(self, k, eps, iters, mu='random'):
    # Number of arms
    self.k = k
    # Search probability
    self.eps = eps
    # Number of iterations
    self.iters = iters
    # Step count
    self.n = 0
    # Step count for each arm
    self.k_n = np.zeros(k)
    # Total mean reward
    self.mean_reward = 0
    self.reward = np.zeros(iters)
    # Mean reward for each arm
    self.k_reward = np.zeros(k)

    if type(mu) == list or type(mu).__module__ == np.__name__:
        # User-defined averages            
        self.mu = np.array(mu)
    elif mu == 'random':
        # Draw means from probability distribution
        self.mu = np.random.normal(0, 1, k)
    elif mu == 'sequence':
        # Increase the mean for each arm by one
        self.mu = np.linspace(0, k-1, k)

def pull(self):
    # Generate random number
    p = np.random.rand()
    if self.eps == 0 and self.n == 0:
        a = np.random.choice(self.k)
    elif p < self.eps:
        # Randomly select an action
        a = np.random.choice(self.k)
    else:
        # Take greedy action
        a = np.argmax(self.k_reward)

    reward = np.random.normal(self.mu[a], 1)

    # Update counts
    self.n += 1
    self.k_n[a] += 1

    # Update total
    self.mean_reward = self.mean_reward + (
        reward - self.mean_reward) / self.n

    # Update results for a_k
    self.k_reward[a] = self.k_reward[a] + (
        reward - self.k_reward[a]) / self.k_n[a]

def run(self):
    for i in range(self.iters):
        self.pull()
        self.reward[i] = self.mean_reward

def reset(self):
    # Resets results while keeping settings
    self.n = 0
    self.k_n = np.zeros(k)
    self.mean_reward = 0
    self.reward = np.zeros(iters)
    self.k_reward = np.zeros(k)

A link to a solution is welcome, but please ensure your answer is useful without it: [add context around the link](//meta.stackexchange.com/a/8259) so your fellow users will have some idea what it is and why it’s there, then quote the most relevant part of the page you're linking to in case the target page is unavailable. [Answers that are little more than a link may be deleted.](//stackoverflow.com/help/deleted-answers) — Baum mit Augen, Nov 12 '17 at 21:33

n-armed bandit simulation in R

4 Answers4