21

I am trying to count the number of columns that do not contain NA for each row, and place that value into a new column for that row.

Example data:

library(data.table)

a = c(1,2,3,4,NA)
b = c(6,NA,8,9,10)
c = c(11,12,NA,14,15)
d = data.table(a,b,c)

> d 
    a  b  c
1:  1  6 11
2:  2 NA 12
3:  3  8 NA
4:  4  9 14
5: NA 10 15

My desired output would include a new column num_obs which contains the number of non-NA entries per row:

    a  b  c num_obs
1:  1  6 11       3
2:  2 NA 12       2
3:  3  8 NA       2
4:  4  9 14       3
5: NA 10 15       2

I've been reading for hours now and so far the best I've come up with is looping over rows, which I know is never advisable in R or data.table. I'm sure there is a better way to do this, please enlighten me.

My crappy way:

len = (1:NROW(d))
for (n in len) {
  d[n, num_obs := length(which(!is.na(d[n])))]
}
Reilstein
  • 1,193
  • 2
  • 11
  • 25
  • 2
    Like `d[, num_obs := sum(!is.na(.SD)), by = 1:nrow(d)][]` or `d[, num_obs := rowSums(!is.na(d))][]`? (Not sure which would be faster.) – A5C1D2H2I1M1N2O1R2T1 Feb 10 '16 at 03:56
  • Yes! this worked. Could you explain it? I thought .SD would equal the entire dataset if I used `by = 1:nrow(d)`. How is it that this does it by row? edit: also, what exactly does adding the empty chain `[ ]` do? – Reilstein Feb 10 '16 at 04:00

2 Answers2

23

Try this one using Reduce to chain together + calls:

d[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))]

If speed is critical, you can eek out a touch more with Ananda's suggestion to hardcode the number of columns being assessed:

d[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))]

Benchmarking using Ananda's larger data.table d from above:

fun1 <- function(indt) indt[, num_obs := rowSums(!is.na(indt))][]
fun3 <- function(indt) indt[, num_obs := Reduce(`+`, lapply(.SD,function(x) !is.na(x)))][]
fun4 <- function(indt) indt[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))][]

library(microbenchmark)
microbenchmark(fun1(copy(d)), fun3(copy(d)), fun4(copy(d)), times=10L)

#Unit: milliseconds
#          expr      min       lq     mean   median       uq      max neval
# fun1(copy(d)) 3.565866 3.639361 3.912554 3.703091 4.023724 4.596130    10
# fun3(copy(d)) 2.543878 2.611745 2.973861 2.664550 3.657239 4.011475    10
# fun4(copy(d)) 2.265786 2.293927 2.798597 2.345242 3.385437 4.128339    10
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Nice option. Plus one. – akrun Feb 10 '16 at 05:29
  • Maybe this will eek out some more speed given the OP's description: `indt[, num_obs := 4 - Reduce("+", lapply(.SD, is.na))][]`. I've hard-coded the "4" to avoid recalculation by row. – A5C1D2H2I1M1N2O1R2T1 Feb 10 '16 at 05:48
  • @AnandaMahto - it shaved a bit more time off - i've updated the answer above. – thelatemail Feb 10 '16 at 05:52
  • @thelatemail - I'm sorry if this is a dumb question, but could you tell me if I have this correct? `.SD` represents the current subset of data, and if no argument is given in the `i` position in data.table, then `.SD` equals the data for all columns for the _current row_, NOT _all rows_ – Reilstein Feb 10 '16 at 19:28
  • @Reilstein - not a dumb question at all. I think `.SD` actually represents the whole data.table if there is nothing in `i` - e.g. `d[,dim(.SD)]` returns a count of both rows and columns. – thelatemail Feb 10 '16 at 22:02
  • @thelatemail - I see, so in that case, in your solution, the `Reduce('+', lapply(.SD,function(x) !is.na(x)))` statement is saying `+` is the function, and `lapply(.SD,function(x) !is.na(x)))` is turning the rows given by `.SD` into a list to be operated on? Do I have that right? I was looking at the `Reduce()` help, and I gathered that it accepts a function, a vector, and an R object, so I am trying to piece together which is which in your solution. Thanks again. – Reilstein Feb 11 '16 at 02:43
  • @Reilstein - ignoring the `NA` checking temporarily for explanation's sake, `Reduce` is taking the parts of the list and doing `(((a + b) + c) + d)` or essentially `a + b + c + d`. Since `+` is vectorised, each part (row) of each vector is added together. Here's another good example of `Reduce` that shows how it can be employed - http://stackoverflow.com/questions/22624284/r-intersecting-strings/22624311 – thelatemail Feb 11 '16 at 02:57
  • Okay great, thanks. I'm beginning to wrap my head around it. – Reilstein Feb 11 '16 at 03:06
10

The two options that quickly come to mind are:

d[, num_obs := sum(!is.na(.SD)), by = 1:nrow(d)][]
d[, num_obs := rowSums(!is.na(d))][]

The first works by creating a "group" of just one row per group (1:nrow(d)). Without that, it would just sum the NA values within the entire table.

The second makes use of an already very efficient base R function, rowSums.

Here is a benchmark on larger data:

set.seed(1)
nrow = 10000
ncol = 15
d <- as.data.table(matrix(sample(c(NA, -5:10), nrow*ncol, TRUE), nrow = nrow, ncol = ncol))

fun1 <- function(indt) indt[, num_obs := rowSums(!is.na(indt))][]
fun2 <- function(indt) indt[, num_obs := sum(!is.na(.SD)), by = 1:nrow(indt)][]

library(microbenchmark)
microbenchmark(fun1(copy(d)), fun2(copy(d)))
# Unit: milliseconds
#           expr        min         lq       mean     median         uq      max neval
#  fun1(copy(d))   3.727958   3.906458   5.507632   4.159704   4.475201 106.5708   100
#  fun2(copy(d)) 584.499120 655.634889 684.889614 681.054752 712.428684 861.1650   100

By the way, the empty [] is just to print the resulting data.table. This is required when you want to return the output from set* functions in "data.table".

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • This is great, thanks @AnandaMahto ! I am a little bit confused about how you created a group of one row using `1:nrow(d)`. My initial thought was that this statement would create a group from row 1 to the final row of the file. I'm clearly missing how the `by` statement works or something. – Reilstein Feb 10 '16 at 04:21
  • 2
    My benchmarks suggest this is quicker again: `fun3 <- function(indt) indt[, num_obs := Reduce(\`+\`, lapply(.SD,function(x) !is.na(x)))][]` – thelatemail Feb 10 '16 at 04:28
  • Yeah on my larger data sets, about 115s vs. 0.06s, so a substantial improvement! Thanks for adding this! – Reilstein Feb 10 '16 at 04:39
  • @Reilstein, out of curiosity, what are the dimensions of your larger dataset? – A5C1D2H2I1M1N2O1R2T1 Feb 10 '16 at 04:56
  • the "larger" dataset I was using for some benchmarking was 4 columns, 1.5 million lines. I eventually will need to run this on something with ~20 columns, 6 million lines, so the speed improvements are a big help :). – Reilstein Feb 10 '16 at 05:35
  • @Reilstein, I would suggest you transfer the check mark so that it draws attention to thelatemail's answer. – A5C1D2H2I1M1N2O1R2T1 Feb 10 '16 at 05:54
  • @Ananda sure I can do that. Thanks again for your help. – Reilstein Feb 10 '16 at 06:24
  • Here's a version when you need to specify columns: `d[,num_obs := rowSums(!is.na(.SD), na.rm=T), .SDcols = c('a','b')]` – kakarot Jan 29 '21 at 13:57