R data.table efficient replication by group

Question

I am running into some memory allocation problems trying to replicate some data by groups using data.table and rep.

Here is some sample data:

ob1 <- as.data.frame(cbind(c(1999),c("THE","BLACK","DOG","JUMPED","OVER","RED","FENCE"),c(4)),stringsAsFactors=FALSE)
ob2 <- as.data.frame(cbind(c(2000),c("I","WALKED","THE","BLACK","DOG"),c(3)),stringsAsFactors=FALSE)
ob3 <- as.data.frame(cbind(c(2001),c("SHE","PAINTED","THE","RED","FENCE"),c(1)),stringsAsFactors=FALSE)
ob4 <- as.data.frame(cbind(c(2002),c("THE","YELLOW","HOUSE","HAS","BLACK","DOG","AND","RED","FENCE"),c(2)),stringsAsFactors=FALSE)
sample_data <- rbind(ob1,ob2,ob3,ob4)
colnames(sample_data) <- c("yr","token","multiple")

What I am trying to do is replicate the tokens (in the present order) by the multiple for each year.

The following code works and gives me the answer I want:

good_solution1 <- ddply(sample_data, "yr", function(x) data.frame(rep(x[,2],x[1,3])))

good_solution2 <- data.table(sample_data)[, rep(token,unique(multiple)),by = "yr"]

The issue is that when I scale this up to 40mm+ rows, I get into memory issues for both possible solutions.

If my understanding is correct, these solutions are essentially doing an rbind which allocates everytime.

Does anyone have a better solution?

I looked at set() for data.table but was running into issues because I wanted to keep the tokens in the same order for each replication.

Since version 1.9.2. (on CRAN 27 Feb 2014), `data.table` has gained a new function `setDT() ` which takes a `list` or `data.frame` and changes its type by reference to `data.table`, *without any copy*. So, `setDT(sample_data)` instead of `data.table(sample_data)` may help to save memory. — Uwe, Dec 03 '17 at 14:25

Arun · Answer 1 · 2013-03-24T11:04:05.290

One way is:

require(data.table)
dt <- data.table(sample_data)
# multiple seems to be a character, convert to numeric
dt[, multiple := as.numeric(multiple)]
setkey(dt, "multiple")
dt[J(rep(unique(multiple), unique(multiple))), allow.cartesian=TRUE]

Everything except the last line should be straightforward. The last line uses a subset using key column with the help of J(.). For each value in J(.) the corresponding value is matched with "key column" and the matched subset is returned.

That is, if you do dt[J(1)] you'll get the subset where multiple = 1. And if you note carefully, by doing dt[J(rep(1,2)] gives you the same subset, but twice. Note that there's a difference between passing dt[J(1,1)] and dt[J(rep(1,2)]. The former is matching values of (1,1) with the first-two-key-columns of the data.table respectively, where as the latter is subsetting by matching (1 and 2) against the first-key column of the data.table.

So, if we were to pass the same value of the column 2 times in J(.), then it gets be duplicated twice. We use this trick to pass 1 1-time, 2 2-times etc.. and that's what the rep(.) part does. rep(.) gives 1,2,2,3,3,3,4,4,4,4.

And if the join results in more rows than max(nrow(dt), nrow(i)) (i is the rep vector that's inside J(.)), you've to explicitly use allow.cartesian = TRUE to perform this join (I guess this is a new feature from data.table 1.8.8).

Edit: Here's some benchmarking I did on a "relatively" big data. I don't see any spike in memory allocations in both methods. But I've yet to find a way to monitor peak memory usage within a function in R. I am sure I've seen such a post here on SO, but it slips me at the moment. I'll write back again. For now, here's a test data and some preliminary results in case anyone is interested/wants to run it for themselves.

# dummy data
set.seed(45)
yr <- 1900:2013
sz <- sample(10:50, length(yr), replace = TRUE)
token <- unlist(sapply(sz, function(x) do.call(paste0, data.frame(matrix(sample(letters, x*4, replace=T), ncol=4)))))
multiple <- rep(sample(500:5000, length(yr), replace=TRUE), sz)

DF <- data.frame(yr = rep(yr, sz), 
                 token = token, 
                 multiple = multiple, stringsAsFactors=FALSE)

# Arun's solution
ARUN.DT <- function(dt) {
    setkey(dt, "multiple")
    idx <- unique(dt$multiple)
    dt[J(rep(idx,idx)), allow.cartesian=TRUE]
}

# Ricardo's solution
RICARDO.DT <- function(dt) {
    setkey(dt, key="yr")
    newDT <- setkey(dt[, rep(NA, list(rows=length(token) * unique(multiple))), by=yr][, list(yr)], 'yr')
    newDT[, tokenReps := as.character(NA)]

    # Add the rep'd tokens into newDT, using recycling
    newDT[, tokenReps := dt[.(y)][, token], by=list(y=yr)]
    newDT
}

# create data.table
require(data.table)
DT <- data.table(DF)

# benchmark both versions
require(rbenchmark)
benchmark(res1 <- ARUN.DT(DT), res2 <- RICARDO.DT(DT), replications=10, order="elapsed")

#                     test replications elapsed relative user.self sys.self
# 1    res1 <- ARUN.DT(DT)           10   9.542    1.000     7.218    1.394
# 2 res2 <- RICARDO.DT(DT)           10  17.484    1.832    14.270    2.888

But as Ricardo says, it may not matter if you run out of memory. So, in that case, there has to be a trade-off between speed and memory. What I'd like to verify is the peak memory used in both methods here to say definitively if using Join is better.

Thanks! Yeah, the character issue was just me being stupid while creating this sample data. — Brad, Mar 24 '13 at 01:54
Do you think that this approach will use less memory? I haven't implemented it for my large sample but I notice that it uses replicate as well. What would the difference in your solution and my good_solution2? — Brad, Mar 24 '13 at 01:55
It'll subset for each value in `J(.)` and keep joining. So, if you had 10 ids with 100 rows each and multiple of 4, then you'll get 10*100*4 = 4000 rows (100 rows in each subset). — Arun, Mar 24 '13 at 01:58
I suspect Ricardo's solution maybe your answer in that case (after reading your question again). I'd just benchmark the two methods on a relatively bigger data (but not too big that you accidentally run out of memory and crash your R-session, but clear enough to distinguish the fastest approach + memory-cautious approach). — Arun, Mar 24 '13 at 02:04
I love benchmarking, but i'm not sure if that will be a clear indication of the better approach. I think ultimately there will have to be a sacrifice of speed to accomplish the goal effectively — Ricardo Saporta, Mar 24 '13 at 03:10
+1 towards that silver `data.table` tag :-) for the great explanation which is helping me learn — Simon O'Hanlon, Mar 24 '13 at 03:11
@RicardoSaporta, I suspect yours will be "faster" if the binding happens without preallocation. But since `data.table` spits out the error that there are 66 rows (in this case) unless you use `allow.cartesian=TRUE`, I wonder if there will a performance difference. If there is a difference, we should be able to see it as the object is being copied every time (akin to for-loop pre-allocation, I suppose). I'll try to figure it out and get back. — Arun, Mar 24 '13 at 09:06
@SimonO101, thanks. I'll try to make a comparison and see if it could be improved. — Arun, Mar 24 '13 at 09:07
@Brad, I tested with a data.frame of dim=3482*3 with multiples ranging from 500 to 2000. The output was 9 million rows. And it worked fine for me. And there wasn't memory difference between Ricardo's and mine (but this is very premature as I was looking at the system monitor and not actually measuring). But for sure, `join` is faster (0.7 vs 1.9 seconds)! So, it'd be great if you could benchmark on your data (also, what's the size of your data.frame?) and report back. thanks. — Arun, Mar 24 '13 at 10:57

Ricardo Saporta · Answer 2 · 2013-03-24T19:39:20.380

1

you can try allocating the memory for all the rows first, and then populating them iteratively.
eg:

  # make sure `sample_data$multiple` is an integer
  sample_data$multiple <- as.integer(sample_data$multiple)

  # create data.table
  S <- data.table(sample_data, key='yr')

  # optionally, drop original data.frame if not needed
  rm(sample_data)

  ## Allocate the memory first
  newDT <- data.table(yr = rep(sample_data$yr, sample_data$multiple), key="yr")
  newDT[, tokenReps := as.character(NA)]

  # Add the rep'd tokens into newDT, using recycling
  newDT[, tokenReps := S[.(y)][, token], by=list(y=yr)]

Two notes:

(1) sample_data$multiple is currently a character and thus getting coerced when passed to rep (in your original example). It might be worth double-checking your real data if that is also the case.

(2) I used the following to determine the number of rows needed per year

S[, list(rows=length(token) * unique(multiple)), by=yr]

edited Mar 24 '13 at 19:39

answered Mar 24 '13 at 01:47

Ricardo Saporta

54,400
17
144
178

Ricardo, it's character because he uses `cbind` and `as.data.frame` to create the data. And `cbind(.)` creates a matrix which is then wrapped around with `as.data.frame`. Since there's no `data.frame` input to `cbind` it is a matrix and so every value is converted to `character`. – Arun Mar 24 '13 at 02:03
Thanks Ricardo and Arun. Ricardo, you have tokenReps and tokenRep in your code, which is giving the final answer and extra column. – Brad Mar 24 '13 at 02:51
@RicardoSaporta, your first allocation step is just: `newDT <- data.table(yr = rep(sample_data$yr, sample_data$multiple), key="yr")`, isn't it? – Arun Mar 24 '13 at 10:43
@Arun, indeed! And that is quite a bit faster and cleaner. Edited my answer to reflect, thanks! – Ricardo Saporta Mar 24 '13 at 19:39

R data.table efficient replication by group

2 Answers2

Two notes: