How can I reshape my data.table efficiently

Question

I have a data.table DT

set.seed(1)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))
DT
    x y   v
 1: 1 A  29
 2: 1 B  92
 3: 1 A 100
 4: 1 B  82
 5: 2 A  28
 6: 2 B  26
 7: 2 A  18
 8: 2 B  22
 9: 3 A  30
10: 3 B  96
11: 3 A  15
12: 3 B   4

I would like to expand it like bellow, creating a new column for each value of x and reporting the v values, no structure in the data should be expected (not by blocks like bellow)

    x y v.1  v.2 v.3
 1: 1 A  29   NA  NA
 2: 1 B  92   NA  NA
 3: 1 A 100   NA  NA
 4: 1 B  82   NA  NA
 5: 2 A  NA   28  NA
 6: 2 B  NA   26  NA
 7: 2 A  NA   18  NA
 8: 2 B  NA   22  NA
 9: 3 A  NA   NA  30
10: 3 B  NA   NA  96
11: 3 A  NA   NA  15
12: 3 B  NA   NA   4

I asked a very similar question here but cannot adapt the answer G Grothendieck gave us at the time...

EDIT: As usual I just almost got it after I wrote the post... I just need to replace those 0 by NA (I might get 0 in v and I want to be able to dissociate v==0 from missing items)

DT2 <- DT[, {SUM.<-factor(x); data.table(model.matrix(~ SUM.:v + 0))}]
txtR) DT2
    SUM.1:v SUM.2:v SUM.3:v
 1:      29       0       0
 2:      92       0       0
 3:     100       0       0
 4:      82       0       0
 5:       0      28       0
 6:       0      26       0
 7:       0      18       0
 8:       0      22       0
 9:       0       0      30
10:       0       0      96
11:       0       0      15
12:       0       0       4

Frank · Accepted Answer · 2013-05-17T15:42:09.833

4

set.seed(1)
DT <- data.table(x=rep(c(1,2,3),each=4), y=c("A","B"), v=sample(1:100,12))

This gives

Next, the answer:

ux <- unique(DT$x)
DT[,c(v.=lapply(ux,function(i)v[x==i])),by="x,y"]

which gives

    x y v.1 v.2 v.3
 1: 1 A  27  NA  NA
 2: 1 A  57  NA  NA
 3: 1 B  37  NA  NA
 4: 1 B  89  NA  NA
 5: 2 A  NA  20  NA
 6: 2 A  NA  97  NA
 7: 2 B  NA  86  NA
 8: 2 B  NA  62  NA
 9: 3 A  NA  NA  58
10: 3 A  NA  NA  19
11: 3 B  NA  NA   6
12: 3 B  NA  NA  16

That answer might break in later versions of R, but the OP pointed out that this works too and may be faster:

DT[,paste0("v.",ux):=lapply(ux,function(i)v[x==i]),by="x"]

edited May 17 '13 at 15:42

answered May 17 '13 at 15:11

Frank

66,179
8
96
180

I think it's sufficient for `by=x`? – Arun May 17 '13 at 15:15
1

Hmm, the desired output has a "y" column anyway. Without putting it in the `by`, would I have to do something like a merge or `DT[,c(list(y=y),v.=lapply(1:3,function(i)v[x==i])),by="x"]`? – Frank May 17 '13 at 15:18
@Frank, I think so. Your current answer with `by="x,y"` won't be identical to OP's requirement. Look carefully at the values you get for your columns and compare with mine.. (except the NA, which I still don't know how to replace while creating model.matrix). – Arun May 17 '13 at 15:20
@statquant Okay, you're right. I've changed it to go over all x values. – Frank May 17 '13 at 15:22
@Arun I still don't see it. Yours has a "v" column; 0s instead of `NA`s; and rows in a different order, but otherwise, they look the same to me... By the way, I didn't read in the OP's data, but re-ran the generating code with a seed. – Frank May 17 '13 at 15:26
1

`DT[,paste0("v.",ux):=lapply(ux,function(i)v[x==i]),by="x"]` works better (and is probably quicker too) – statquant May 17 '13 at 15:30
I'm not seeing warnings with ... R version 2.15.2 (2012-10-26); Platform: x86_64-w64-mingw32/x64 (64-bit); data.table_1.8.6 – Frank May 17 '13 at 15:30
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/30139/discussion-between-frank-and-statquant) – Frank May 17 '13 at 15:31
@statquant, yes I get the warning too on R 3.0.0 – Arun May 17 '13 at 15:33

Arun · Answer 2 · 2013-05-17T15:30:16.820

3

Here's one way:

tt <- model.matrix(data=DT, ~ factor(x):rep(1, nrow(DT)) + 0)
tt[tt==0] <- NA
cbind(DT, DT$v * tt)
#     x y   v factor(x)1:v factor(x)2:v factor(x)3:v
#  1: 1 A  69           69           NA           NA
#  2: 1 B  39           39           NA           NA
#  3: 1 A  76           76           NA           NA
#  4: 1 B  49           49           NA           NA
#  5: 2 A 100           NA          100           NA
#  6: 2 B  95           NA           95           NA
#  7: 2 A  36           NA           36           NA
#  8: 2 B  73           NA           73           NA
#  9: 3 A  86           NA           NA           86
# 10: 3 B  20           NA           NA           20
# 11: 3 A  59           NA           NA           59
# 12: 3 B  12           NA           NA           12

edited May 17 '13 at 15:30

answered May 17 '13 at 15:12

Arun

116,683
26
284
387

do you see a trick to replace those 0s by NA it's too bad cause it's lightnin' quick... – statquant May 17 '13 at 15:25
@statquant, not straightforward, but I've managed.. check the edit please. – Arun May 17 '13 at 15:30
@statquant, You mean efficient in terms of runtime or lines of code (or both)? – Arun May 17 '13 at 17:05

eddi · Answer 3 · 2013-05-17T17:02:32.563

3

You could simply loop over the x's and use data.table assignment:

setkey(DT, x)
for (i in unique(DT$x)) {
  DT[J(i), paste0("v.", i) := v]
}

P.S. I really wish the following worked, but .GRP is not available there:

DT[, paste0("v.", .GRP) := v, by = x]

edit one more solution (trying to get the above .GRP idea to work somehow), using rbind.fill (I didn't run very careful benches, but this seemed to scale quite well)

library(plyr)

cbind(DT,
      rbind.fill(DT[, list(list(setnames(data.table(v), paste0("v.", .GRP)))),
                      by = x]$V1))

edited May 17 '13 at 17:02

answered May 17 '13 at 15:48

eddi

49,088
6
104
155

2

+1. Re: the second idea, that sounds like a good feature request, eh? – Frank May 17 '13 at 15:52
If the feature enabling `DT[, paste0("v.", .GRP) := v, by = x]` were requested it would be nice if it included a facility to specify the fill value in case some other value than `NA` were desired. – G. Grothendieck May 17 '13 at 16:54
@Frank, I second this. But I think this has been raised sometime before.. iirc. And I think this would be the fastest by far, especially as the columns are already allocated in data.table.. – Arun May 17 '13 at 17:14

How can I reshape my data.table efficiently

3 Answers3