Is my way of duplicating rows in data.table efficient?

Question

I have monthly data in one data.table and annual data in another data.table and now I want to match the annual data to the respective observation in the monthly data.

My approach is as follows: Duplicating the annual data for every month and then join the monthly and annual data. And now I have a question regarding the duplication of rows. I know how to do it, but I'm not sure if it is the best way to do it, so some opinions would be great.

Here is an exemplatory data.table DT for my annual data and how I currently duplicate:

library(data.table)
DT <- data.table(ID = paste(rep(c("a", "b"), each=3), c(1:3, 1:3), sep="_"),
                    values = 10:15,
                    startMonth = seq(from=1, by=2, length=6),
                    endMonth = seq(from=3, by=3, length=6))
DT
      ID values startMonth endMonth
[1,] a_1     10          1        3
[2,] a_2     11          3        6
[3,] a_3     12          5        9
[4,] b_1     13          7       12
[5,] b_2     14          9       15
[6,] b_3     15         11       18
#1. Alternative
DT1 <- DT[, list(MONTH=startMonth:endMonth), by="ID"]
setkey(DT,  ID)
setkey(DT1, ID)
DT1[DT]
ID MONTH values startMonth endMonth
a_1     1     10          1        3
a_1     2     10          1        3
a_1     3     10          1        3
a_2     3     11          3        6
[...]

The last join is exactly what I want. However, DT[, list(MONTH=startMonth:endMonth), by="ID"] already does everything I want except adding the other columns to DT, so I was wondering if I could get rid of the last three rows in my code, i.e. the setkey and join operations. It turns out, you can, just do the following:

#2. Alternative: More intuitiv and just one line of code
DT[, list(MONTH=startMonth:endMonth, values, startMonth, endMonth), by="ID"]
 ID MONTH values startMonth endMonth
a_1    1     10          1        3
a_1    2     10          1        3
a_1    3     10          1        3
a_2    3     11          3        6
...

This, however, only works because I hardcoded the column names into the list expression. In my real data, I do not know the names of all columns in advance, so I was wondering if I could just tell data.table to return the column MONTH that I compute as shown above and all the other columns of DT. .SD seemed to be able to do the trick, but:

DT[, list(MONTH=startMonth:endMonth, .SD), by="ID"]
Error in `[.data.table`(DT, , list(YEAR = startMonth:endMonth, .SD), by = "ID") : 
  maxn (4) is not exact multiple of this j column's length (3)

So to summarize, I know how it's been done, but I was just wondering if this is the best way to do it because I'm still struggling a little bit with the syntax of data.table and often read in posts and on the wiki that there are good and bads ways of doing things. Also, I don't quite get why I get an error when using .SD. I thought it is just any easy way to tell data.table that you want all columns. What do I miss?

score 15 · Answer 1 · edited Jan 22 '13 at 21:46

15

Looking at this I realized that the answer was only possible because ID was a unique key (without duplicates). Here is another answer with duplicates. But, by the way, some NA seem to creep in. Could this be a bug? I'm using v1.8.7 (commit 796).

library(data.table)
DT <- data.table(x=c(1,1,1,1,2,2,3),y=c(1,1,2,3,1,1,2))

DT[,rep:=1L][c(2,7),rep:=c(2L,3L)]   # duplicate row 2 and triple row 7
DT[,num:=1:.N]                       # to group each row by itself

DT
   x y rep num
1: 1 1   1   1
2: 1 1   2   2
3: 1 2   1   3
4: 1 3   1   4
5: 2 1   1   5
6: 2 1   1   6
7: 3 2   3   7

DT[,cbind(.SD,dup=1:rep),by="num"]
    num x y rep dup
 1:   1 1 1   1   1
 2:   2 1 1   1  NA      # why these NA?
 3:   2 1 1   2  NA
 4:   3 1 2   1   1
 5:   4 1 3   1   1
 6:   5 2 1   1   1
 7:   6 2 1   1   1
 8:   7 3 2   3   1
 9:   7 3 2   3   2
10:   7 3 2   3   3

Just for completeness, a faster way is to rep the row numbers and then take the subset in one step (no grouping and no use of cbind or .SD) :

DT[rep(num,rep)]
    x y rep num
 1: 1 1   1   1
 2: 1 1   2   2
 3: 1 1   2   2
 4: 1 2   1   3
 5: 1 3   1   4
 6: 2 1   1   5
 7: 2 1   1   6
 8: 3 2   3   7
 9: 3 2   3   7
10: 3 2   3   7

where in this example data the column rep happens to be the same name as the rep() base function.

edited Jan 22 '13 at 21:46

Matt Dowle

58,872
22
166
224

answered Jan 16 '13 at 13:35

statquant

13,672
21
91
162

Thanks. I ran it (v1.8.7) but I don't see the `NA`. Which version do you have? – Matt Dowle Jan 16 '13 at 15:05
Thanks. I still don't see `NA` but now I get two warnings both identical: `In 1:rep : numerical expression has 2 elements: only the first used` – Matt Dowle Jan 16 '13 at 16:02
Try latest (796) as first step then, please, just to rule it out. – Matt Dowle Jan 16 '13 at 16:38
Ok, I'll try again. Let's keep this one on S.O. then rather than datatable-help. Thanks ... – Matt Dowle Jan 16 '13 at 18:27
@MatthewDowle I can reproduce the `NA`s by commenting out the last line or changing the assignment to something other than `DT` in the first block of code. I think `DT <- DT[,cbind(dup=1:rep,.SD),by="num"]` and `DT <- DT[,cbind(.SD,dup=1:rep),by="num"]` are meant to be alternatives, but the first replaces `DT`. – user1935457 Jan 16 '13 at 19:21
Also, in the group with `NA`s, it's worth noting `.SD` and `1:rep` are different lengths. Reproducing the same calculation outside of `[.data.table` gives the same answer for `cbind(1:rep, DT[num==2])` but an error for `cbind(DT[num==2], 1:rep)`: `length of 'dimnames' [1] not equal to array extent`. So my guess is that same error is happening in `dogroups.c`, hence the `NA` for the calculation of `dup` there. – user1935457 Jan 16 '13 at 19:34
I see the NA! Yipee. Thanks @user1935467, yes I was pasting the first part to get the expected result and then pasting the line after that. And, thanks statquant. Right, now to fix ... – Matt Dowle Jan 16 '13 at 20:25
Good! not sure if you saw the end of my first reply, but taking `DT` as the initial example from Christoph_J with the "*alternative*" aka "*buggy*" aka "*`DT[,cbind(.SD,dup=1:rep),by="num"]`*" will fail, it might help you with the diagnostic. – statquant Jan 16 '13 at 21:52
@MatthewDowle See the line containing `x[[i]] = rep(xi, length.out = nr)` in the R function `data.table`. Replacing it with something like `x[[i]] = rbindlist(lapply(seq_len(nr), function(x) xi))` using `fix` gave the expected result for me, though I'm not sure how efficient that is. The other order `cbind(1:rep, .SD)` works because it doesn't trigger `if (inherits(..1, "data.table"))` in `cbind.data.frame`. – user1935457 Jan 16 '13 at 23:02
You deserve a proper name, @user1935457. Thank you. – Matt Dowle Jan 16 '13 at 23:18
@MatthewDowle I can't think of a good one. Beware that I didn't check what happens with that modification in place when there is the possibility that one of the `x[[i]]` with less than `nr` rows is not a `data.table`. – user1935457 Jan 16 '13 at 23:26
@Metthew Dowle: I am afraid the +100 went in my account and not on user1935457... ?! – statquant Jan 18 '13 at 06:20
statquant, @user1935457, [Bug#2478](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2478&group_id=240&atid=975) now fixed. Both the `cbind` root cause and warning added to dogroups. Thanks again to you both. – Matt Dowle Jan 22 '13 at 20:06
@statquant Yes I meant the +100 for you. It was a great solution and you found an important bug too. – Matt Dowle Jan 22 '13 at 20:08

Matt Dowle · Accepted Answer · 2011-12-04T11:31:18.643

Great question. What you tried was very reasonable. Assuming you're using v1.7.1 it's now easier to make list columns. In this case it's trying to make one list column out of .SD (3 items) alongside the MONTH column of the 2nd group (4 items). I'll raise it as a bug [EDIT: now fixed in v1.7.5], thanks.

In the meantime, try :

DT[, cbind(MONTH=startMonth:endMonth, .SD), by="ID"]
 ID MONTH values startMonth endMonth
a_1     1     10          1        3
a_1     2     10          1        3
a_1     3     10          1        3
a_2     3     11          3        6
...

Also, just to check you've seen roll=TRUE? Typically you'd have just one startMonth column (irregular with gaps) and then just roll join to it. Your example data has overlapping month ranges though, so that complicates it.

score 0 · Answer 3 · answered Nov 04 '11 at 14:13

Here is a function I wrote which mimics disaggregate (I needed something that handled complex data). It might be useful for you, if it isn't overkill. To expand only rows, set the argument fact to c(1,12) where 12 would be for 12 'month' rows for each 'year' row.

zexpand<-function(inarray, fact=2, interp=FALSE,  ...)  {
fact<-as.integer(round(fact))
switch(as.character(length(fact)),
        '1' = xfact<-yfact<-fact,
        '2'= {xfact<-fact[1]; yfact<-fact[2]},
        {xfact<-fact[1]; yfact<-fact[2];warning(' fact is too long. First two values used.')})
if (xfact < 1) { stop('fact[1] must be > 0') } 
if (yfact < 1) { stop('fact[2] must be > 0') }
# new nonloop method, seems to work just ducky
bigtmp <- matrix(rep(t(inarray), each=xfact), nrow(inarray), ncol(inarray)*xfact, byr=T)   
#does column expansion
bigx <- t(matrix(rep((bigtmp),each=yfact),ncol(bigtmp),nrow(bigtmp)*yfact,byr=T))
return(invisible(bigx))
}

score -2 · Answer 4 · edited Dec 02 '16 at 06:38

-2

The fastest and most succinct way of doing it:

DT[rep(1:nrow(DT), endMonth - startMonth)]

We can also enumerate by group by:

dd <- DT[rep(1:nrow(DT), endMonth - startMonth)]
dd[, nn := 1:.N, by = ID]
dd

edited Dec 02 '16 at 06:38

Uwe

41,420
11
90
134

answered Dec 01 '16 at 23:52

user7238835

45
6

We can also enumerate by group with: dd=DT[rep(1:nrow(DT),endMonth-startMonth)] – user7238835 Dec 01 '16 at 23:56
We can also enumerate by group by: `code` dd=DT[rep(1:nrow(DT),endMonth-startMonth)] dd[,nn:=1:.N,by=ID] `code` – user7238835 Dec 01 '16 at 23:57
Please edit your answer don't fill the comment section. Use the _edit_ link above. – Marcs Dec 02 '16 at 00:20
2

This doesn't answer the question and even doesn't reproduce the expected result. Your result has only 27 rows while the expected result has 33. The enumeration by group starts always at 1 for each group and thus doesn't return the `MONTH` numbers. This all would have become immediately visible if you would have added some output to your answer. – Uwe Dec 02 '16 at 06:52

Is my way of duplicating rows in data.table efficient?

4 Answers4

Linked