Counting the NA's in a part of a row in data.table

Question

I have a dataset df of which the structure looks similar to the example below:

nr countrycode questionA questionB questionC WeightquestionA WeightquestionB WeightquestionC
1  NLD               2         1         4         0.6             0.2             0.2
2  NLD               NA        4         NA        0.4             0.4             0.2
3  NLD               4         4         1         0.2             0.2             0.6
4  BLG               1         NA        1         0.1             0.5             0.4
5  BLG               5         3         5         0.2             0.2             0.6

The questions A, B and C relate to a similar topic and as a result I would like to create an average score for all questions, taking into account the importance of each question (WeightquestionA WeightquestionB WeightquestionC).

Currently I have manually calculated the average score.

(questionA*WeightquestionA) + (questionB*WeightquestionB) + (questionC*WeightquestionC)

This would not be an insurmountable problem were it not for the NA's (for which: no they cannot be removed). As a result I would like to automate the process.

I am currently thinking of using sum(!is.na()) for counting the non-NA's in each question (A,B,C) for each row (1 through 5) and putting that value into a new column.

With data.table I however always have trouble getting the syntax right. I believe it should be something like:

df[, NonNA:=sum(!is.na(questionA + questionB + questionC))]

But this sums all NA's in the column, instead of for each row. How should I write the syntax to calculate per row?

I would like to refer to the columns separately by name, because they are not next to each other in the actual df.

Desired output:

nr countrycode qA qB qC WeightquestionA WeightquestionB WeightquestionC NonNA
1  NLD         2  1  4         0.6             0.2             0.2      3
2  NLD         NA 4  NA        0.4             0.4             0.2      1
3  NLD         4  4  1         0.2             0.2             0.6      3
4  BLG         1  NA 1         0.1             0.5             0.4      2
5  BLG         5  3  5         0.2             0.2             0.6      3

Related: [R: data.table count !NA per row](https://stackoverflow.com/questions/35306500/r-data-table-count-na-per-row). Select relevant columns with `.SDcols`. — Henrik, Sep 26 '18 at 09:10

Tino · Accepted Answer · 2020-08-06T10:45:38.873

3

Using data.table, you could do this:

df[, NonNA := sum(!is.na(questionA), !is.na(questionB), !is.na(questionC)), by = .(nr)]

A base solution:

df$nonNA <- rowSums(!is.na(df[,c("questionA", "questionB", "questionC")]))

edited Aug 06 '20 at 10:45

answered Sep 26 '18 at 09:06

Tino

2,091
13
15

1

Thanks a lot! My pick for being simple, close to my own approach and providing a base solution. – Tom Sep 26 '18 at 09:21

chinsoon12 · Answer 2 · 2018-09-26T09:11:56.753

2

Another alternative with recommendation from snoram:

df[, NonNA := rowSums(!is.na(.SD)), 
    .SDcols=paste0("question", LETTERS[1:3])]

And also:

df[, NonNA := Reduce(function(x, y) x + !is.na(y), .SD, init=rep(0L, .N)), 
    .SDcols=paste0("question", LETTERS[1:3])]

edited Sep 26 '18 at 09:11

answered Sep 26 '18 at 09:06

chinsoon12

25,005
4
25
35

1

You can drop the as.matrix() – s_baldur Sep 26 '18 at 09:11

Saurabh Chauhan · Answer 3 · 2018-09-26T09:19:50.630

1

We can count non NA (for column questionA,questionB and questionC i.e. column number 3 to 5) using apply as below:

df$nonNA=apply(df[,3:5], 1, function(x) length(which(!is.na(x))))

or (suggestion from snoarm)

df$nonNA=apply(df[,3:5], 1, function(x) sum(!is.na(x)))

Sample output:

   questionA questionB questionC nonNA
1         2         1         4     3
2        NA         4        NA     1
3         4         4         1     3
4         1        NA         1     2
5         5         3         5     3

edited Sep 26 '18 at 09:19

answered Sep 26 '18 at 09:00

Saurabh Chauhan

3,161
2
19
46

1

Can shorten to: `apply(df[,3:5], 1, function(x) sum(!is.na(x)))` – s_baldur Sep 26 '18 at 09:17
@snoram: Thanks for suggestion. I have incorporated your suggestions too :) – Saurabh Chauhan Sep 26 '18 at 09:20
@snoram or: `rowSums(!is.na(df[,3:5]))` – Tino Sep 26 '18 at 09:21

Counting the NA's in a part of a row in data.table

3 Answers3

Linked