1

I have data on this form in a data.table DT:

DT = data.table(
 year=c('1981', '1981', '1981', '2005', '2005', '2005'),
    value=c(2, 8, 16, 3, 9, 27),
    order =c(1,2,3,1,2,3))
year value order
'1981' 2 1
'1981' 8 2
'1981' 16 3
'2005' 3 1
'2005' 9 2
'2005' 27 3

And I want to create new columns based first on the order within a specific year, but then sequentially on the order if I shift it. As you can see value=16 which starts as order=3 on row 1, is logged as order = 2 on row 2, etc.

year order1 order2 order3
'1981' 2 8 16
'1981' 8 16 NA
'1981' 16 NA NA
'2005' 3 9 27
'2005' 9 27 NA
'2005' 27 NA NA

If I wanted it just by order, and get rows 1 and 4 as output, I could do:

dcast(DT, year ~ order, value.var = c('value'))

But how can I cast based on order while incorporating this reordering?

I could perhaps create new columns indicating the new shifted order, using:

DT[,order_2:= c(NA,1,2,NA,1,2)]
DT[,order_3:= c(NA,NA,1,NA,NA,1)]

But then how do I do casting on all three columns? Is there a more elegant way than just casting 3 times and then joining the results?

Hal
  • 75
  • 6

4 Answers4

4

You don't necessarily need dcast, try this:

DT[, lapply(seq_along(value), \(v) {l <- length(value); `length<-`(value[v:l], l)}), by=year]
#    year V1 V2 V3
# 1: 1981  2  8 16
# 2: 1981  8 16 NA
# 3: 1981 16 NA NA
# 4: 2005  3  9 27
# 5: 2005  9 27 NA
# 6: 2005 27 NA NA
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    nice concise solution : upvoted ;-). However, for a general solution perhaps first need to make sure DT is ordered by `year,order` (which happens to be the case here) – Waldi Dec 03 '22 at 11:40
  • Great, this seems to work for the base case where I have the same number of columns for each group, but not for cases with fewer ordered observations. What is the meaning of the \(), `length<-` and {} here? – Hal Dec 03 '22 at 13:44
  • @Hal You probably wonder about the `\(v)`, in new R versions its shorthand for `function(v)`, change this or better update your R. The others are all functions and documented: `?\`length<-\``, `?\`{\``. – jay.sf Dec 03 '22 at 13:53
1

We could use shift with transpose

library(data.table)
DT[, setNames(transpose(shift(value, n = seq_len(.N)-1, type = "lead")), 
      paste0("order", order)), year]

-output

   year order1 order2 order3
1: 1981      2      8     16
2: 1981      8     16     NA
3: 1981     16     NA     NA
4: 2005      3      9     27
5: 2005      9     27     NA
6: 2005     27     NA     NA
akrun
  • 874,273
  • 37
  • 540
  • 662
1

Here's a dcast, a little bit of other work to get it right:

library(data.table)
dcast(year + order ~ paste0("order", ord2),
      data = DT[DT, on = .(year == year, order <= order)][, ord2 := seq(.N), by = .(year, order)],
      value.var = "i.value"
      )[, order := NULL
       ][, lapply(.SD, sort, na.last = TRUE), .SDcols = patterns("^order"), by = .(year)][]
dcast(year + order ~ paste0("order", ord2), data = DT[DT, on = .(year == year, order <= order)][, ord2 := seq(.N), by = .(year, order)], value.var = "i.value")[, order := NULL][, lapply(.SD, sort, na.last = TRUE), .SDcols = patterns("^order"), by = .(year)][]
#      year order1 order2 order3
#    <char>  <num>  <num>  <num>
# 1:   1981      2      8     16
# 2:   1981      8     16     NA
# 3:   1981     16     NA     NA
# 4:   2005      3      9     27
# 5:   2005      9     27     NA
# 6:   2005     27     NA     NA

Caveat: with this sample data, it is significantly slower than the other two answers. It is offered primarily to show that while it may be possible, it is not always the most efficient path:

bench::mark(
  jay.sf  = DT[, lapply(seq_along(value), \(v) {l <- length(value); `length<-`(value[v:l], l)}), by=year],
  akrun   = DT[, setNames(transpose(shift(value, n = seq_len(.N)-1, type = "lead")), paste0("order", order)), year],
  r2evans = dcast(year + order ~ paste0("order", ord2), data = DT[DT, on = .(year == year, order <= order)][, ord2 := seq(.N), by = .(year, order)], value.var = "i.value")[, order := NULL][, lapply(.SD, sort, na.last = TRUE), .SDcols = patterns("^order"), by = .(year)][],
  check = FALSE)
# # A tibble: 3 x 13
#   expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory              time               gc                  
#   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list>              <list>             <list>              
# 1 jay.sf      225.2us 266.15us     3642.    48.6KB     5.38  1354     2      372ms <NULL> <Rprofmem [11 x 3]> <bench_tm [1,356]> <tibble [1,356 x 3]>
# 2 akrun       262.9us  307.4us     3149.    48.6KB     2.41  1305     1      414ms <NULL> <Rprofmem [12 x 3]> <bench_tm [1,306]> <tibble [1,306 x 3]>
# 3 r2evans      2.74ms   2.99ms      330.   453.3KB     2.30   143     1      434ms <NULL> <Rprofmem [91 x 3]> <bench_tm [144]>   <tibble [144 x 3]>  
r2evans
  • 141,215
  • 6
  • 77
  • 149
0

using shift(value, 0:-2) makes this a one liner:

setnames(DT[, shift(value,(-0:-2)),year], c("year", paste0("order",1:3)))[]

Output:

   year order1 order2 order3
1: 1981      2      8     16
2: 1981      8     16     NA
3: 1981     16     NA     NA
4: 2005      3      9     27
5: 2005      9     27     NA
6: 2005     27     NA     NA
langtang
  • 22,248
  • 1
  • 12
  • 27