6

I'm looking for a more efficient way to sample from a list of integers 1:n, multiple times, where the probability vector (also length n) is different each time. For 20 trials with n = 10, I know one can do it like this:

probs <- matrix(runif(200), nrow = 20)
answers <- numeric(20)
for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,])

But that calls sample 10 times just to get a single number each time, so it's is presumably not the fastest way. Speed would be helpful as the code will be doing this plenty of times.

Many thanks!

Luke

Edit: Big thanks to Roman, whose idea about benchmarking helped me find a good solution. I've now moved this to the answer.

AeroX
  • 3,387
  • 2
  • 25
  • 39
lukeholman
  • 595
  • 6
  • 14
  • 1
    +1 You should add your random roll answer as a solution. It's a pretty cool approach! Have you checked how scaleable it is? – Simon O'Hanlon May 18 '13 at 09:11
  • It's important to note that the `prob` argument in R `sample` function *without replacement* is NOT proportional to the first order inclusion probabilities. If you want to preserve this, then check out package `sampling` @ CRAN. – Ferdinand.kraft May 18 '13 at 21:14
  • Thanks for the input guys. Ferdinand, you lost me a little bit there, but I guess in this example it doesn't matter because the sample is of length 1 (so sampling with and without replacement is the same). Also the solution in luke2 avoids sample altogether. I'll list it as a solution. – lukeholman May 20 '13 at 06:10
  • @lukeholman, you're right, my comment doesn't apply for a sample of size one. But I'll leave it there in case someone wants to use the functions here in a more general situation. – Ferdinand.kraft May 21 '13 at 01:13

2 Answers2

2

Just for fun, I tried two more versions. On what scale are you doing this sampling? I think all of these are pretty fast and more or less equivalent (I haven't included the creation of probs for your solution). Would love to see others take a shot at this.

library(rbenchmark)
benchmark(replications = 1000,
          luke = for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,]),
          roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
          roman2 = replicate(20, sample(10, 1, prob = runif(10))))

    test replications elapsed relative user.self sys.self user.child sys.child
1   luke         1000    0.41    1.000      0.42        0         NA        NA
2  roman         1000    0.47    1.146      0.46        0         NA        NA
3 roman2         1000    0.47    1.146      0.44        0         NA        NA
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
1

Here's another approach that I found. It's fast, but not as fast as simply calling sample many times with a for loop. I initially thought it was very good, but I was using benchmark() incorrectly.

luke2 = function(probs) { # takes a matrix of probability vectors, each in its own row
                probs <- probs/rowSums(probs) 
                probs <- t(apply(probs,1,cumsum)) 
                answer <- rowSums(probs - runif(nrow(probs)) < 0) + 1 
                return(answer)  }

Here's how it works: picture the probabilities as lines of various lengths laid out on a number line from 0 to 1. The big probabilities will take up more of the number line than the small ones. You could then pick the outcome by picking a random point on the number line - the big probabilities will have more likelihood of being chosen. The advantage of this approach is that you can roll all the random numbers needed in one call of runif(), instead of calling sample over and over as in the functions luke, roman and roman2. However, it looks like the extra data processing slows it down and the costs more than offset this benefit.

library(rbenchmark)
probs <- matrix(runif(2000), ncol = 10)
answers <- numeric(200)

benchmark(replications = 1000,
          luke = for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,]),
          luke2 = luke2(probs),
          roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
          roman2 = replicate(20, sample(10, 1, prob = runif(10))))
              roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
              roman2 = replicate(20, sample(10, 1, prob = runif(10))))

    test replications elapsed relative user.self sys.self user.child sys.child
    1   luke         1000   0.171    1.000     0.166    0.005          0         0
    2  luke2         1000   0.529    3.094     0.518    0.012          0         0
    3  roman         1000   1.564    9.146     1.513    0.052          0         0
    4 roman2         1000   0.225    1.316     0.213    0.012          0         0

For some reason, apply() does very badly as you add more rows. I don't understand why, because I thought it was a wrapper for for() and should therefore roman() should perform similarly to luke().

lukeholman
  • 595
  • 6
  • 14
  • `luke2` is not being called. the third argument to `benchmark` merely *defines* a function, it doesn't execute it. You should define the function outside the `benchmark` call and use instead something like `luke2=luke2(probs),roman=...`. – Ferdinand.kraft May 21 '13 at 01:11
  • Doh, thanks for that. I see the difference now between how I was doing and how roman was using his. Turns out it's not so good after all! I still feel like there must be a better solution out there - call sample over and over can't be the best way. – lukeholman May 24 '13 at 07:36