3

I am trying to simulate the Chinese Restaurant process in R, and wondering if I can make any efficiency improvements over this crude implementation.

iTables = 200  # number of tables
iSampleSize = 1000  # number of diners

# initialize the list of tables
listTableOccupants = vector('list', iTables)

for(currentDiner in seq.int(iSampleSize)) {
  # occupation probabilities for the next diner
  vProbabilities = sapply(listTableOccupants, 
                          function(x) ifelse(!is.null(x),
                                             length(x)/currentDiner,
                                             1/currentDiner))
  # pick the index of the lucky table
  iTable = sample.int(iTables, size = 1, prob = vProbabilities)

  # add to the list element corresponding to the table
  listTableOccupants[[iTable]] = 
    c(listTableOccupants[[iTable]], currentDiner) 
}

In particular, I am concerned about this line:

  # add to the list element corresponding to the table
  listTableOccupants[[iTable]] = 
    c(listTableOccupants[[iTable]], currentDiner) 

Is this efficient?

tchakravarty
  • 10,736
  • 12
  • 72
  • 116
  • What is the question? Why do you think you have an efficiency problem? For large datasets I'd recommend `listTableOccupants <- matrix(nr=iSampleSize, nc=iTables)` and filling the designated slot `listTableOccupants[currentDiner,iTable]<-currentDiner` , thus avoiding the need to re-allocate space. – Carl Witthoft Nov 21 '13 at 13:32
  • @CarlWitthoft I don't know how to write stochastic process simulations efficiently, and this is more of a code review question. Also, I think that your way would require me to allocate an `iTables*iSampleSize` matrix which, depending on `iSampleSize` might be very large. Plus the data structure that I have used corresponds neatly to the mathematical notion of a partition. – tchakravarty Nov 22 '13 at 01:35

1 Answers1

0

To avoid space reallocation and sparse data structures, you can instead apply a table label to each diner. For example,

nDnr <- 100  # number of diners; must be at least 2
vDnrTbl <- rep(0, nDnr)  # table label for each diner
alpha <- 2  # CRP parameter

vDnrTbl[1] <- 1
for (dnr in 2:length(vDnrTbl)) {

    # compute occupation probabilities for current diner
    vOcc <- table(vDnrTbl[1:(dnr-1)])
    vProb <- c(vOcc, alpha) / (dnr - 1 + alpha)

    # add table label to diner
    nTbl <- as.numeric(names(vOcc)[length(vOcc)])  # avoid overhead of finding max of possibly large vector
    vDnrTbl[dnr] <- sample.int(nTbl+1, size=1, prob=vProb)
}

From vDnrTbl, you can obtain listTableOccupants:

nTbl <- max(c(nTbl, vDnrTbl[dnr]))
listTableOccupants <- lapply(1:nTbl, function(t) which(vDnrTbl == t))