12

I have a data.table with columns of different data types. My goal is to select only numeric columns and replace NA values within these columns by 0. I am aware that replacing na-values with zero goes like this:

DT[is.na(DT)] <- 0

To select only numeric columns, I found this solution, which works fine:

DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]

I can achieve what I want by assigning

DT2 <- DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]

and then do:

DT2[is.na(DT2)] <- 0

But of course I would like to have my original DT modified by reference. With the following, however:

DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
                 [is.na(DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE])]<- 0

I get

"Error in [.data.table([...] i is invalid type (matrix)"

What am I missing? Any help is much appreciated!!

smci
  • 32,567
  • 20
  • 113
  • 146
HannesZ
  • 599
  • 2
  • 5
  • 17
  • You are missing the basic syntax of data.tables, which don't do `DT[...] <- y`. Try reading the vignettes https://github.com/Rdatatable/data.table/wiki/Getting-started It's a more efficient way to learn than "finding solutions" for each step you think you need to take. The answer below doesn't even require the with=FALSE trick you found. – Frank May 23 '16 at 13:09
  • Thanks for the advice. Could you please eloborate on the basic syntax error "...which don't do DT[...] <- y". What does that mean? Why does the assignment work in one case and not in the other case? I could not find anything in the vignettes, would still help me alot to understand.. – HannesZ May 23 '16 at 13:42
  • Data tables shouldn't be used like `DT[...] <- y` where `...` is whatever you have in mind. Assignment is done with `:=` or `set` not with a `<-`. The arrow way actually does work in special cases, in the sense that the table is modified, but it does not work by reference (last I checked) and so is not idiomatic. To work with data.tables, you'll have to learn some of their idioms. If you don't already know what I mean by `:=`, that's a good reason to check out the vignettes. – Frank May 23 '16 at 14:17
  • a) It will be much more efficient to compute the column-list `numeric_cols <- which(sapply(DT,is.numeric))` once-off at the top, instead of inside each j-expression, for each group. b) Then just reference `DT[, numeric_cols]` c) Yes, putting a function-call inside the j-expression is tricky and often tickles syntax error. – smci Apr 19 '18 at 00:52

3 Answers3

12

We can use set

for(j in seq_along(DT)){
    set(DT, i = which(is.na(DT[[j]]) & is.numeric(DT[[j]])), j = j, value = 0)
 }

Or create a index for numeric columns, loop through it and set the NA values to 0

ind <-   which(sapply(DT, is.numeric))
for(j in ind){
    set(DT, i = which(is.na(DT[[j]])), j = j, value = 0)
}

data

set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • What does `set( ..., j = j, ...)` mean? All columns? Surely we only need to do `set()` on the subset of columns that are numeric, as OP asked? – smci Apr 19 '18 at 01:10
  • 1
    @smci Not all columns. In the code I got the `ind` which gets the column index of `numeric` columns, so, it is only looping through those columns – akrun Apr 19 '18 at 02:06
  • Ok. Why can't you avoid looping, by using `ind` to index into `names(DT)` to get a `list` of column-names and pass that as the j-argument of `set()`? I guess the expression to find NAs would then need to be 2D. Well I guess `set()` is already fairly fast. – smci Apr 19 '18 at 02:59
  • @smci Not sure I don't understand your question. The `j` can take either columnames or the column index. Here, 'ind' is the index. – akrun Apr 19 '18 at 04:09
  • Why can't you avoid the loop `for(j in ind) { ... set(..., j=j, ...) }` ? Can't you directly do `set(DT, j=ind)` in general? I think you could, but the only reason for the j-loop is that the i-expression to find NA rows for that specific j changes. – smci Apr 19 '18 at 04:56
4

I wanted to explore and possibly improve on the excellent answer given above by @akrun. Here's the data he used in his example:

library(data.table)

set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
DT

#>    v1   v2         v3
#> 1: NA <NA> -0.5458808
#> 2:  1    A  0.5365853
#> 3:  2    B  0.4196231
#> 4:  3    C -0.5836272
#> 5:  4    D         NA

And the two methods he suggested to use:

fun1 <- function(x){
  for(j in seq_along(x)){
  set(x, i = which(is.na(x[[j]]) & is.numeric(x[[j]])), j = j, value = 0)
  }
}

fun2 <- function(x){
  ind <-   which(sapply(x, is.numeric))
  for(j in ind){
    set(x, i = which(is.na(x[[j]])), j = j, value = 0)
  }
}

I think the first method above is really genius as it exploits the fact that NAs are typed.

First of all, even though .SD is not available in i argument, it is possible to pull the column name with get(), so I thought I could sub-assign data.table this way:

fun3 <- function(x){
  nms <- names(x)[sapply(x, is.numeric)]
  for(j in nms){
    x[is.na(get(j)), (j):=0]
  }
}

Generic case, of course would be to rely on .SD and .SDcols to work only on numeric columns

fun4 <- function(x){
  nms <- names(x)[sapply(x, is.numeric)]
  x[, (nms):=lapply(.SD, function(i) replace(i, is.na(i), 0)), .SDcols=nms]  
}

But then I thought to myself "Hey, who says we can't go all the way to base R for this sort of operation. Here's simple lapply() with conditional statement, wrapped into setDT()

fun5 <- function(x){
setDT(
  lapply(x, function(i){
    if(is.numeric(i))
         i[is.na(i)]<-0
    i
  })
)
}

Finally,we could use the same idea of conditional to limit the columns on which we apply the set()

fun6 <- function(x){
  for(j in seq_along(x)){
    if (is.numeric(x[[j]]) )
      set(x, i = which(is.na(x[[j]])), j = j, value = 0)
  }
}

Here are the benchmarks:

microbenchmark::microbenchmark(
  for.set.2cond = fun1(copy(DT)),
  for.set.ind = fun2(copy(DT)),
  for.get = fun3(copy(DT)),
  for.SDcol = fun4(copy(DT)),
  for.list = fun5(copy(DT)),
  for.set.if =fun6(copy(DT))
)

#> Unit: microseconds
#>           expr     min      lq     mean   median       uq      max neval cld
#>  for.set.2cond  59.812  67.599 131.6392  75.5620 114.6690 4561.597   100 a  
#>    for.set.ind  71.492  79.985 142.2814  87.0640 130.0650 4410.476   100 a  
#>        for.get 553.522 569.979 732.6097 581.3045 789.9365 7157.202   100   c
#>      for.SDcol 376.919 391.784 527.5202 398.3310 629.9675 5935.491   100  b 
#>       for.list  69.722  81.932 137.2275  87.7720 123.6935 3906.149   100 a  
#>     for.set.if  52.380  58.397 116.1909  65.1215  72.5535 4570.445   100 a  
dmi3kno
  • 2,943
  • 17
  • 31
2

You need tidyverse purrr function map_if along with ifelse to do the job in a single line of code.

library(tidyverse)
set.seed(24)
DT <- data.table(v1= sample(c(1:3,NA),20,replace = T), v2 = sample(c(LETTERS[1:3],NA),20,replace = T), v3=sample(c(1:3,NA),20,replace = T))

Below single line code takes a DT with numeric and non numeric columns and operates just on the numeric columns to replace the NAs to 0:

DT %>% map_if(is.numeric,~ifelse(is.na(.x),0,.x)) %>% as.data.table

So, tidyverse can be less verbose than data.table sometimes :-)

Lazarus Thurston
  • 1,197
  • 15
  • 33