9

Background Before running a stepwise model selection, I need to remove missing values for any of my model terms. With quite a few terms in my model, there are therefore quite a few vectors that I need to look in for NA values (and drop any rows that have NA values in any of those vectors). However, there are also vectors that contain NA values that I do not want to use as terms / criteria for dropping rows.

Question How do I drop rows from a dataframe which contain NA values for any of a list of vectors? I'm currently using the clunky method of a long series of !is.na's

> my.df[!is.na(my.df$termA)&!is.na(my.df$termB)&!is.na(my.df$termD),]

but I'm sure that there is a more elegant method.

Oreotrephes
  • 447
  • 1
  • 4
  • 10

3 Answers3

12

Let dat be a data frame and cols a vector of column names or column numbers of interest. Then you can use

dat[!rowSums(is.na(dat[cols])), ]

to exclude all rows with at least one NA.

Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • 1
    This is, handily, the best solution to the problem of eliminating `NA`s in particular columns. I still like the `with` solution since it allows you to do other conditionals nicely and then also works nicely with altering data *in situ* using `within`. – Tyler Dec 14 '13 at 01:39
8

Edit: I completely glossed over subset, the built in function that is made for sub-setting things:

my.df <- subset(my.df, 
  !(is.na(termA) |
    is.na(termB) |
    is.na(termC) )
  )

I tend to use with() for things like this. Don't use attach, you're bound to cut yourself.

my.df <- my.df[with(my.df, {
  !(is.na(termA) |
    is.na(termB) |
    is.na(termC) )
}), ]

But if you often do this, you might also want a helper function, is_any()

is_any <- function(x){
  !is.na(x)
}

If you end up doing a lot of this sort of thing, using SQL is often going to be a nicer interaction with subsets of data. dplyr may also prove useful.

Tyler
  • 626
  • 4
  • 13
1

This is one way:

#  create some random data
df <- data.frame(y=rnorm(100),x1=rnorm(100), x2=rnorm(100),x3=rnorm(100))
# introduce random NA's
df[round(runif(10,1,100)),]$x1 <- NA
df[round(runif(10,1,100)),]$x2 <- NA
df[round(runif(10,1,100)),]$x3 <- NA

# this does the actual work...
# assumes data is in columns 2:4, but can be anywhere
for (i in 2:4) {df <- df[!is.na(df[,i]),]}

And here's another, using sapply(...) and Reduce(...):

xx <- data.frame(!sapply(df[2:4],is.na))
yy <- Reduce("&",xx)
zz <- df[yy,]

The first statement "applies" the function is.na(...) to columns 2:4 of df, and inverts the result (we want !NA). The second statement applies the logical & operator to the columns of xx in succession. The third statement extracts only rows with yy=T. Clearly this can be combined into one horrifically complicated statement.

zz <-df[Reduce("&",data.frame(!sapply(df[2:4],is.na))),]

Using sapply(...) and Reduce(...) can be faster if you have very many columns.

Finally, most modeling functions have parameters that can be set to deal with NA's directly (without resorting to all this). See, for example the na.action parameter in lm(...).

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • 1
    These are clearly better than my solution when dealing with NAs. For reasonably sized data frames the for loop also has the advantage of being easily understood. I like the solution using `with` for the added advantage that it translates well to multiple disparate criteria (color == 'green', species %in% c('setosa', 'versicolor') etc.) – Tyler Dec 04 '13 at 03:25