R: remove multiple rows based on missing values in fewer rows

Question

I have an R data frame with data from multiple subjects, each tested several times. To perform statistics on the set, there is a factor for subject ("id") and a row for each observation (given by factor "session"). I.e.

print(allData)
id     session     measure
1      1           7.6
2      1           4.5
3      1           5.5
1      2           7.1
2      2           NA
3      2           4.9

In the above example, is there a simple way to remove all rows with id==2, given that the "measure" column contains NA in one of the rows where id==2?

More generally, since I actually have a lot of measures (columns) and four sessions (rows) for each subject, is there an elegant way to remove all rows with a given level of the "id" factor, given that (at least) one of the rows with this "id"-level contains NA in a column?

I have the intuition that there could be a build-in function that could solve this problem more elegantly than my current solution:

# Which columns to check for NA's in
probeColumns = c('measure1','measure4') # Etc...

# A vector which contains all levels of "id" that are present in rows with NA's in the probeColumns
idsWithNAs = allData[complete.cases(allData[probeColumns])==FALSE,"id"]

# All rows that isn't in idsWithNAs
cleanedData = allData[!allData$id %in% idsWithNAs,]

Thanks, /Jonas

There's probably a way to do it with `sqldf`, but I don't think it would be fundamentally more simple. — IRTFM, Mar 28 '12 at 12:14

score 3 · Accepted Answer · answered Mar 29 '12 at 00:18

You can use the ddply function from the plyr package to 1) subset your data by id, 2) apply a function that will return NULL if the sub data.frame contains NA in the columns of your choice, or the data.frame itself otherwise, and 3) concatenate everything back into a data.frame.

allData <- data.frame(id       = rep(1:4, 3),
                      session  = rep(1:3, each = 4),
                      measure1 = sample(c(NA, 1:11)),
                      measure2 = sample(c(NA, 1:11)),
                      measure3 = sample(c(NA, 1:11)),
                      measure4 = sample(c(NA, 1:11)))
allData                      
#    id session measure1 measure2 measure3 measure4
# 1   1       1        3        7       10        6
# 2   2       1        4        4        9        9
# 3   3       1        6        6        7       10
# 4   4       1        1        5        2        3
# 5   1       2       NA       NA        5       11
# 6   2       2        7       10        6        5
# 7   3       2        9        8        4        2
# 8   4       2        2        9        1        7
# 9   1       3        5        1        3        8
# 10  2       3        8        3        8        1
# 11  3       3       11       11       11        4
# 12  4       3       10        2       NA       NA

# Which columns to check for NA's in
probeColumns = c('measure1','measure4')

library(plyr)
ddply(allData, "id",
      function(df)if(any(is.na(df[, probeColumns]))) NULL else df)
#   id session measure1 measure2 measure3 measure4
# 1  2       1        4        4        9        9
# 2  2       2        7       10        6        5
# 3  2       3        8        3        8        1
# 4  3       1        6        6        7       10
# 5  3       2        9        8        4        2
# 6  3       3       11       11       11        4

Thanks, flodel! I think the real value of the ddply solution is that it's much more flexible than my home-made solution above. I can simply add further conditions and operations to the function, if I need it. — Jonas Lindeløv, Apr 17 '12 at 07:03

DrDom · Answer 2 · 2012-03-29T05:42:02.083

0

Using your example two last commands of it can be transformed in such string. It should produce the same result and it looks simplier.

cleanedData <- allData[complete.cases(allData[,probeColumns]),]

This is a correct version which uses only base package. Just for fun. :) But it's neither compact nor simple. Answer of flodel is neater. Even your initial solution is more compact and I think faster.

cleanedData <- do.call(rbind, sapply(unique(allData[,"id"]), function(x) {if(all(!is.na(allData[allData$id==x, probeColumn]))) allData[allData$id==x,]}))

edited Mar 29 '12 at 05:42

answered Mar 28 '12 at 12:25

DrDom

4,033
1
21
23

Thanks. However, your proposal would only remove the rows with NA's in them (row 5 in the example above). I'm looking for a solution that additionally removes row 2, because it has the same level of "id" as row 5. – Jonas Lindeløv Mar 28 '12 at 13:14
@Jonas, I'm sorry, I didn't understand what you exactly wanted. I'll add another answer just for fun, which uses base package. But the answer of flodel is more compact and nice. – DrDom Mar 29 '12 at 05:25

R: remove multiple rows based on missing values in fewer rows

2 Answers2

Linked