NA values in indexing data frame

Question

When subsetting data.frames by some condition, then if the data frame contains NAs, it might happen that you get NA value as a result of a condition. Then it will make problems in subsetting the data.frame:

# data generation
set.seed(123)
df <- data.frame(a = 1:100, b = sample(c("moon", "venus"), 100, replace = TRUE), c = sample(c('a', 'b', NA), 100, replace = TRUE))

# indexing
with(df, df[a < 30 & b == "moon" & c == "a",])

You get:

      a    b    c
NA   NA <NA> <NA>
10   10 moon    a
12   12 moon    a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29   29 moon    a

This happens because the condition results in vector containing NAs and then these NAs will produce the above result in indexing the data frame.

One of the solution would be one of these fixes:

with(df, df[a < 30 & b == "moon" & (c == "a" & !is.na(c)),])  # exclude NAs
with(df, df[a < 30 & b == "moon" & (c == "a" | is.na(c)),])  # include NAs

but these are pretty clumsy - imagine that you have a long condition like df[A == x1 & B == x2 & C == x3 & D == x4,] and you have to wrap each element like this - df[(A == x1 | is.na(A)) & (B == x2 | is.na(B)) ...,].

Is there any elegant solution to this problem which doesn't require you to write these tons of code on the console if you just try to inspect a data frame?

This question could do without the added rant. This behavior is documented extensively in `?Extract`, "stupid" or not. — Joshua Ulrich, Oct 17 '13 at 15:28
@SimonO101: then post an answer there, or update Shane's answer. I'm simply trying to keep useful content from being spread across multiple questions. — Joshua Ulrich, Oct 17 '13 at 15:41

mrip · Accepted Answer · 2013-10-17T16:07:18.353

Well, if you want to omit the NA rows, one quick and hackish solution is to wrap it in which:

> with(df, df[a < 30 & b == "moon" & c == "a",])
      a    b    c
NA   NA <NA> <NA>
10   10 moon    a
12   12 moon    a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29   29 moon    a
> with(df, df[which(a < 30 & b == "moon" & c == "a"),])
    a    b c
10 10 moon a
12 12 moon a
29 29 moon a

On edit: another option in circumstances like this, which might be frowned upon by some, but which I personally find very useful, is to define a local variable within the brackets:

> with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i | is.na(i)},])
    a    b    c
6   6 moon <NA>
10 10 moon    a
12 12 moon    a
15 15 moon <NA>
18 18 moon <NA>
29 29 moon    a
> with(df, df[{i<-a < 30 & b == "moon" & c == "a"; i & !is.na(i)},])
    a    b c
10 10 moon a
12 12 moon a
29 29 moon a

This is more concise than either writing a special function or defining the indices on a separate line, and is applicable in many situations where there isn't an R function that does exactly what you want.

+1 I thought this was basically what I suggested, but its a better and more succinct use of `which`. — Simon O'Hanlon, Oct 17 '13 at 15:46
nice solution! It is almost perfect!! Absolute perfect solution from my point of view would be like some global parameter setting that would change the way `[` operator works on NAs... — Tomas, Oct 17 '13 at 16:39

score 1 · Answer 2 · answered Oct 17 '13 at 15:40

You could use the data.table package. This would simplify the code, since you don't have to include everything in a with(df, ...) and it treats NAs as FALSE.

require(data.table)
dt <- data.table(df)
dt[a < 30 & b == "moon" & c == "a",] # exclude NAs
dt[a < 30 & b == "moon" & (c == "a"|is.na(c)),] # include NAs

score 1 · Answer 3 · answered Oct 17 '13 at 15:41

clean <- function(x, include = FALSE){
    x[is.na(x)] <- include
    x
}

# Original output
with(df, df[a < 30 & b == "moon" & c == "a",])
# Clean it up and remove NAs
with(df, df[clean(a < 30 & b == "moon" & c == "a"),])
# Clean it up but include NAs
with(df, df[clean(a < 30 & b == "moon" & c == "a", include = TRUE),])

which gives

> with(df, df[a < 30 & b == "moon" & c == "a",])
      a    b    c
NA   NA <NA> <NA>
10   10 moon    a
12   12 moon    a
NA.1 NA <NA> <NA>
NA.2 NA <NA> <NA>
29   29 moon    a
> 
> with(df, df[clean(a < 30 & b == "moon" & c == "a"),])
    a    b c
10 10 moon a
12 12 moon a
29 29 moon a
> with(df, df[clean(a < 30 & b == "moon" & c == "a", include = TRUE),])
    a    b    c
6   6 moon <NA>
10 10 moon    a
12 12 moon    a
15 15 moon <NA>
18 18 moon <NA>
29 29 moon    a

Using which could also work but it would only allow you to exclude values by default

NA values in indexing data frame

3 Answers3