1

Working with data.table in R, I am trying to join two columns and create a new column in which I have the unique values of the previous step. In the example below you can see that the code works fine with frame z1, but with frame z2 I got an error. However, both tables were created in the same way. The columns have different information, but it should not be a reason for the same code not working on z2.

Thank you so much for your help, and please let me know if I am not clear.

Best,

library(data.table)


z1 <- data.table(a = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2014_HIES_D1_INC_GROUP"), 
                 b = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2015_HIES_D1_INC_GROUP"))


z2 <- data.table(a = c("ARG_1980_EPH_D2_INC_GROUP", "ARG_1980_EPH_D2_INC_GROUP"), 
                 b = c("ARG_1986_EPH_D2_INC_HIST", "ARG_1986_EPH_D2_INC_HIST"))


z1[,
   cache_id := as.list(apply(.SD, 1, unique)),
   .SDcols = c("a", "b")
]

z1[]
#>                             a                          b
#> 1: ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP
#> 2: ARE_2014_HIES_D1_INC_GROUP ARE_2015_HIES_D1_INC_GROUP
#>                                                 cache_id
#> 1:                            ARE_2014_HIES_D1_INC_GROUP
#> 2: ARE_2014_HIES_D1_INC_GROUP,ARE_2015_HIES_D1_INC_GROUP


z2[,
   cache_id := as.list(apply(.SD, 1, unique)),
   .SDcols = c("a", "b")
]
#> Error in `[.data.table`(z2, , `:=`(cache_id, as.list(apply(.SD, 1, unique))), : Supplied 4 items to be assigned to 2 items of column 'cache_id'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
z2[]
#>                            a                        b
#> 1: ARG_1980_EPH_D2_INC_GROUP ARG_1986_EPH_D2_INC_HIST
#> 2: ARG_1980_EPH_D2_INC_GROUP ARG_1986_EPH_D2_INC_HIST

Created on 2023-06-12 with reprex v2.0.2

zx8754
  • 52,746
  • 12
  • 114
  • 209
  • 2
    Never, ever, use `apply` when working with a data.table. – Roland Jun 13 '23 at 08:41
  • @Roland I don't think that's a good advice though. See another [answer](https://stackoverflow.com/a/48906142/9454926) which utilized `apply/parApply` function to leverage parallel processing. The `apply` approach is quite readable too. – Hieu Nguyen Jun 13 '23 at 17:42
  • @HieuNguyen I disagree. a) data.table is already parallelized internally. b) If you are iterating over rows, you are usually not using an efficient approach. – Roland Jun 14 '23 at 05:11
  • @Roland There are a small set of functions that is GForce optimize, however, custom function like the post I linked above still benefit from explicit parallelization. The OP in that post commented that it speed up alot for them. – Hieu Nguyen Jun 14 '23 at 06:17
  • @HieuNguyen If you really must iterate over rows (you shouldn't need to), you should write an Rcpp function. Explicit parallelization is only worth it, if your function is complex and time-consuming and in that case I would wonder why you are using data.table. – Roland Jun 14 '23 at 06:25
  • @Roland, Do you think that a better approach is to reshape and then get the unique values by group? Or, is it a better approach outside data.table? I use data.table as my main data frames syntax, assuming it is, most of the times, the fastest approach. However, I am always on the look up for more efficient approaches. Thanks. – R.Andres Castaneda Jun 14 '23 at 15:51
  • 1
    It depends on the specifics but with your example I would first question why you are even doing it. With your example, I would melt the data.table and use data.table::unique. No loop over rows is needed. – Roland Jun 14 '23 at 17:24

5 Answers5

2

apply returns matrix if each result is of same length otherwise list:

apply(z1[,.(a ,b)], 1, unique)

[[1]]
[1] "ARE_2014_HIES_D1_INC_GROUP"

[[2]]
[1] "ARE_2014_HIES_D1_INC_GROUP" "ARE_2015_HIES_D1_INC_GROUP"

apply(z2[, .(a, b), 1, unique)

     [,1]                        [,2]                       
[1,] "ARG_1980_EPH_D2_INC_GROUP" "ARG_1980_EPH_D2_INC_GROUP"
[2,] "ARG_1986_EPH_D2_INC_HIST"  "ARG_1986_EPH_D2_INC_HIST" 

Also, as.list on matrix does not give you column-wise list but u get each element as element of list:

as.list(apply(z2[, .(a, b)], 1, unique))

[[1]]
[1] "ARG_1980_EPH_D2_INC_GROUP"

[[2]]
[1] "ARG_1986_EPH_D2_INC_HIST"

[[3]]
[1] "ARG_1980_EPH_D2_INC_GROUP"

[[4]]
[1] "ARG_1986_EPH_D2_INC_HIST"

hence the warning for the length.

I'm not exactly sure what your end result should be so I can't provide definite answer.

det
  • 5,013
  • 1
  • 8
  • 16
  • 1
    Using `simplify = FALSE` in `apply` can make the result of `apply` to be consistently a list. Then, the answer for OP is simply `z2[, cache_id := apply(.SD, 1, unique, simplify = FALSE), .SDcols = c("a", "b")]` – Hieu Nguyen Jun 13 '23 at 17:26
1

You can try the following approach:

z1[, cache_id:=list(.(unique(c(a,b)))), 1:nrow(z1)]

and similarly for z2

Output:

                            a                          b                                              cache_id
                       <char>                     <char>                                                <list>
1: ARE_2014_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP                            ARE_2014_HIES_D1_INC_GROUP
2: ARE_2014_HIES_D1_INC_GROUP ARE_2015_HIES_D1_INC_GROUP ARE_2014_HIES_D1_INC_GROUP,ARE_2015_HIES_D1_INC_GROUP
langtang
  • 22,248
  • 1
  • 12
  • 27
1

Another a approach which doesn't need to iterate over rows like @langtang's answer:

z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c("a", "b")
   ][, cache_id := sapply(cache_id, paste, collapse = ", ")]
Hieu Nguyen
  • 492
  • 3
  • 8
1

It is unclear why you are doing this and I suspect we are dealing with an xy problem here. Anyway, you should almost never need to iterate over the rows of a data.table. Usually that's a design issue. If you really need to do it, then turn to Rcpp if it isn't a one-off or if your data.table is actually large.

Anyway, in the specific example, you can use data.table::unique:

library(data.table)
z1 <- data.table(a = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2014_HIES_D1_INC_GROUP"), 
                b = c("ARE_2014_HIES_D1_INC_GROUP", "ARE_2015_HIES_D1_INC_GROUP"))
z1[, rn := .I]
unique(melt(z1, "rn"), by = c("rn", "value"))
#   rn variable                      value
#1:  1        a ARE_2014_HIES_D1_INC_GROUP
#2:  2        a ARE_2014_HIES_D1_INC_GROUP
#3:  2        b ARE_2015_HIES_D1_INC_GROUP

If you must, you can then split the value column by rn and add it to the data.table. But again, why would you need that?

Roland
  • 127,288
  • 10
  • 191
  • 288
0

Thank you all for your answers. They really helped me out and I learned more about apply and data.table. I select @langtang answer because it is the fastest. Yet, Thak you so much @hieu-nguyen for both solutions. I think the simply = FALSE was the key to the problem, but you made that point in a comment, which I can't select as the answer. PLease, find below the benchmark

library(data.table)


n <- 1e4
x <- sapply(1:n, \(x) sample(letters, 10) |> paste(collapse = ""))
y <- sapply(1:n, \(x) sample(letters, 10) |> paste(collapse = ""))

ni <- sample(1:n, floor(n/10), replace = FALSE)

x[ni] <- y[ni]



z1 <- data.table(a = x, 
                 b = y)

bench <- microbenchmark::microbenchmark(
  times = 30,
  simplify = z1[,
                cache_id := as.list(apply(.SD, 1, unique, simplify = FALSE)),
                .SDcols = c("a", "b")],
  loop_rows = z1[, cache_id:=list(.(unique(c(a,b)))), 1:nrow(z1)], 
  mapply    = z1[, cache_id := lapply(.mapply(c, .SD, NULL), unique), .SDcols = c("a", "b")]
)

bench
#> Unit: milliseconds
#>       expr       min        lq     mean   median       uq      max neval cld
#>   simplify 145.03549 171.67624 209.7165 214.4948 244.1717 268.3255    30 a  
#>  loop_rows  80.62317  98.74864 110.2403 106.6774 122.0016 148.4702    30  b 
#>     mapply 337.39212 409.21162 482.0041 478.9344 544.5397 765.9302    30   c

Created on 2023-06-13 with reprex v2.0.2