When using `data.table`'s DT[ i , j, by], is it possible to set the column types before hand?

Question

I'm trying to calculate the correlation between two variables for multiple different groups (e.g. DT[, cor.test(var1, var2), group]). This works great whenever I use cor.test(var1, var2, method = 'pearson') but throws an error when I use cor.test(var1, var2, method = 'spearman').

library(data.table)
DT <- as.data.table(iris)

# works perfectly 
DT[,cor.test(Sepal.Length,Sepal.Width, method = 'pearson'), Species]
#       Species statistic parameter      p.value  estimate null.value
# 1:     setosa  7.680738        48 6.709843e-10 0.7425467          0
# 2:     setosa  7.680738        48 6.709843e-10 0.7425467          0
# 3: versicolor  4.283887        48 8.771860e-05 0.5259107          0
# 4: versicolor  4.283887        48 8.771860e-05 0.5259107          0
# 5:  virginica  3.561892        48 8.434625e-04 0.4572278          0
# 6:  virginica  3.561892        48 8.434625e-04 0.4572278          0
#    alternative                               method
# 1:   two.sided Pearson's product-moment correlation
# 2:   two.sided Pearson's product-moment correlation
# 3:   two.sided Pearson's product-moment correlation
# 4:   two.sided Pearson's product-moment correlation
# 5:   two.sided Pearson's product-moment correlation
# 6:   two.sided Pearson's product-moment correlation
#                       data.name  conf.int
# 1: Sepal.Length and Sepal.Width 0.5851391
# 2: Sepal.Length and Sepal.Width 0.8460314
# 3: Sepal.Length and Sepal.Width 0.2900175
# 4: Sepal.Length and Sepal.Width 0.7015599
# 5: Sepal.Length and Sepal.Width 0.2049657
#> 6: Sepal.Length and Sepal.Width 0.6525292

# error
DT[,cor.test(Sepal.Length,Sepal.Width, method = 'spearman'), Species]
# Error in `[.data.table`(DT, , cor.test(Sepal.Length, Sepal.Width, method = "spearman"), : 
# Column 2 of j's result for the first group is NULL. We rely on the column types of the first 
# result to decide the type expected for the remaining groups (and require consistency). NULL 
# columns are acceptable for later groups (and those are replaced with NA of appropriate type 
# and recycled) but not for the first. Please use a typed empty vector instead, such as 
# integer() or numeric().

Question:

I know there are work arounds for this specific example, but it is possible to tell data.table before hand what the column types are going to be for any case using DT[i,j,by = 'something']?

dww · Accepted Answer · 2019-11-21T01:22:42.047

In case you want to keep all columns, rather than remove the ones with a NULL, You can set the class of the 'problem' column manually (in this case the column giving issues is "parameter") . This would be preferable to removing the NULLs, if the column does contain values for some groups but not others.

DT[, {
  res <- cor.test(Sepal.Length, Sepal.Width, method = 'spearman')
  class(res$parameter) <- 'integer'
  res
  }, Species]

#      Species statistic parameter      p.value  estimate null.value alternative                          method                    data.name
#1:     setosa  5095.097        NA 2.316710e-10 0.7553375          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width
#2: versicolor 10045.855        NA 1.183863e-04 0.5176060          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width
#3:  virginica 11942.793        NA 2.010675e-03 0.4265165          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width

You just opened up a whole new world for me! I didn't realize you could have multiple steps in j!! Stupid question.... Does it obey the same rules as an R script? Just trying to soak up as much as I can from you. Thanks for the great example! — Dewey Brooke, Nov 21 '19 at 01:36
sure - everything enclosed inside `{..}` will run just like any other R code. In fact, this isn't just something you can do in the data.table `j`, but it can be done *anywhere* inside *any* R function — dww, Nov 21 '19 at 01:59

score 2 · Answer 2 · answered Nov 21 '19 at 01:04

In my opinion, the error msg is actually quite self-explanatory:

Column 2 of j's result for the first group is NULL. We rely on the column types of the first result to decide the type expected for the remaining groups (and require consistency). NULL columns are acceptable for later groups (and those are replaced with NA of appropriate type and recycled) but not for the first. Please use a typed empty vector instead, such as integer() or numeric().

You might want to use to filter out the NULLs (but be careful that the NULL location are the same across each by:

DT[, Filter(Negate(is.null), cor.test(Sepal.Length,Sepal.Width, method = 'spearman')), Species]

output:

      Species statistic      p.value  estimate null.value alternative                          method                    data.name
1:     setosa  5095.097 2.316710e-10 0.7553375          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width
2: versicolor 10045.855 1.183863e-04 0.5176060          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width
3:  virginica 11942.793 2.010675e-03 0.4265165          0   two.sided Spearman's rank correlation rho Sepal.Length and Sepal.Width

See R: removing NULL elements from a list

Ugh! This works great! Stupid question however... how is`Filter(Negate(is.null)` actually working? _You can point me to a link if that's a better use of your time._ Also, why can you have `Filter(Negate(is.null)` in the "j" position? I'm sorta bewildered. — Dewey Brooke, Nov 21 '19 at 01:12
`Negate` returns the logical opposite of a function output. `Filter` extracts elements from a vector that meets a condition. hence `Filter(Negate(is.null))` removes NULLs from your list output from `cor.test` in `j`. — chinsoon12, Nov 21 '19 at 01:18

When using `data.table`'s DT[ i , j, by], is it possible to set the column types before hand?

Question:

2 Answers2

Linked