125

Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).

library(dplyr)
df = data.frame(
    x1 = c(1,2,3,NA),
    x2 = c(1,2,NA,5)
)

df %.%
  filter(complete.cases(x1,x2))
zx8754
  • 52,746
  • 12
  • 114
  • 209
user2503795
  • 4,035
  • 2
  • 34
  • 49
  • 4
    `complete.cases` doesn't just accept vectors. It takes whole data frames, as well. – joran Mar 12 '14 at 13:56
  • But that doesn't work as part of `dplyr`'s filter function. I guess I wasn't clear enough and updated my question. – user2503795 Mar 12 '14 at 14:40
  • 1
    It would help if you could demonstrate exactly how it doesn't work with dplyr, but when I try it with filter, it works just fine. – joran Mar 12 '14 at 14:52

7 Answers7

230

Try this:

df %>% na.omit

or this:

df %>% filter(complete.cases(.))

or this:

library(tidyr)
df %>% drop_na

If you want to filter based on one variable's missingness, use a conditional:

df %>% filter(!is.na(x1))

or

df %>% drop_na(x1)

Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.

str(df %>% na.omit)
## 'data.frame':   2 obs. of  2 variables:
##  $ x1: num  1 2
##  $ x2: num  1 2
##  - attr(*, "na.action")= 'omit' Named int  3 4
##    ..- attr(*, "names")= chr  "3" "4"

ADDED Have updated to reflect latest version of dplyr and comments.

ADDED Have updated to reflect latest version of tidyr and comments.

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
27

This works for me:

df %>%
  filter(complete.cases(df))    

Or a little more general:

library(dplyr) # 0.4
df %>% filter(complete.cases(.))

This would have the advantage that the data could have been modified in the chain before passing it to the filter.

Another benchmark with more columns:

set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
  na.omit = {df %>% na.omit},
  filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
  rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
  filter = {df %>% filter(complete.cases(.))},
  times = 20L,
  unit = "relative")

#Unit: relative
#             expr       min        lq    median         uq       max neval
 #         na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233    20
 #filter.anonymous  1.149305  1.022891  1.013779  0.9948659  4.668691    20
 #         rowSums  2.281002  2.377807  2.420615  2.3467519  5.223077    20
 #          filter  1.000000  1.000000  1.000000  1.0000000  1.000000    20
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Miha Trošt
  • 2,002
  • 22
  • 25
  • 1
    I updated your answer with "." in the complete.cases and added benchmark - hope you don't mind :-) – talat Jan 26 '15 at 11:48
  • 1
    I found `df %>% slice(which(complete.cases(.)))` performed ~ 20% faster than the filter-approach in the benchmark above. – talat Jan 26 '15 at 13:49
  • 1
    It's worth noting that if you are using this filter in a dplyr pipe with other dplyr commands (such as group_by()), you will need to add `%>% data.frame() %>%` before you try and filter on complete.cases(.) because it won't work on tibbles or grouped tibbles or something. Or at least, that has been the experience I have had. – C. Denney Mar 01 '19 at 17:13
  • I found that filter(complete.cases(.)) works with 'tibbles' but does not work with grouped 'tibbles'. You can use ungroup() if you want to keep your data in a 'tibble' rather than convert to a proper data frame. – Knackiedoo Jun 10 '23 at 08:26
18

Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.

library('rbenchmark')
library('dplyr')

n = 5e6
n.na = 100000
df = data.frame(
    x1 = sample(1:10, n, replace=TRUE),
    x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA


benchmark(
    df %>% filter(complete.cases(x1,x2)),
    df %>% na.omit(),
    df %>% (function(x) filter(x, complete.cases(x)))()
    , replications=50)

#                                                  test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))()           50   5.422    1.000
# 1               df %.% filter(complete.cases(x1, x2))           50   6.262    1.155
# 2                                    df %.% na.omit()           50 109.618   20.217
user2503795
  • 4,035
  • 2
  • 34
  • 49
13

This is a short function which lets you specify columns (basically everything which dplyr::select can understand) which should not have any NA values (modeled after pandas df.dropna()):

drop_na <- function(data, ...){
    if (missing(...)){
        f = complete.cases(data)
    } else {
        f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
    }
    filter(data, f)
}

[drop_na is now part of tidyr: the above can be replaced by library("tidyr")]

Examples:

library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs
Jan Katins
  • 2,219
  • 1
  • 25
  • 35
  • Wouldn't be even more useful to be able to add a cutoff like 0.5 and have it process by columns? Case: eliminate variables with 50% and over missing data. Example : data[, -which(colMeans(is.na(data)) > 0.5)] It would be nice to be able to do this with tidyr. – Monduiz Feb 09 '17 at 17:02
  • @Monduiz This would mean that the addition of more data (where a variable then has lots of NA) could fail the next step in the pipeline because a needed variable is now missing... – Jan Katins Feb 09 '17 at 17:21
  • Right, that makes sense. – Monduiz Feb 09 '17 at 18:13
7

try this

df[complete.cases(df),] #output to console

OR even this

df.complete <- df[complete.cases(df),] #assign to a new data.frame

The above commands take care of checking for completeness for all the columns (variable) in your data.frame.

infominer
  • 1,981
  • 13
  • 17
  • Thanks. I guess I wasn't clear enough though (question updated). I know about complete.cases(df) but I would like to do it with `dplyr` as part of the filter function. That would allow a neat integration in dplyr chains etc. – user2503795 Mar 12 '14 at 14:38
  • In `dplyr:::do.data.frame` the statement `env$. <- .data` adds dot to the environment. No such statement in magrittr::"%>%"` – G. Grothendieck May 13 '14 at 11:47
  • Sorry must have entered the comment in wrong place. – G. Grothendieck May 13 '14 at 12:09
3

Just for the sake of completeness, dplyr::filter can be avoided altogether but still be able to compose chains just by using magrittr:extract (an alias of [):

library(magrittr)
df = data.frame(
  x1 = c(1,2,3,NA),
  x2 = c(1,2,NA,5))

df %>%
  extract(complete.cases(.), )

The additional bonus is speed, this is the fastest method among the filter and na.omit variants (tested using @Miha Trošt microbenchmarks).

mbask
  • 2,471
  • 18
  • 17
  • When I do the benchmark with the data by Miha Trošt, I find that using `extract()` is almost ten times slower than `filter()`. However, when I create a smaller data frame with `df <- df[1:100, 1:10]`, the picture changes and `extract()` is the fastest. – Stibu Jan 20 '17 at 08:39
  • You are correct. It looks like `magrittr::extract` is the fastest way only when `n <= 5e3` in Miha Trošt benchmark. – mbask Jan 24 '17 at 08:21
1

dplyr >= 1.0.4

if_any and if_all are available in newer versions of dplyr to apply across-like syntax in the filter function. This could be useful if you had other variables in your dataframe that were not part of what you considered complete case. For example, if you only wanted non-missing rows in columns that start with "x":

library(dplyr)
df = data.frame(
  x1 = c(1,2,3,NA),
  x2 = c(1,2,NA,5),
  y = c(NA, "A", "B", "C")
)

df %>% 
  dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))

  x1 x2    y
1  1  1 <NA>
2  2  2    A

For more information on these functions see this link.

LMc
  • 12,577
  • 3
  • 31
  • 43