Filter for complete cases in data.frame using dplyr (case-wise deletion)

Question

Is it possible to filter a data.frame for complete cases using dplyr? complete.cases with a list of all variables works, of course. But that is a) verbose when there are a lot of variables and b) impossible when the variable names are not known (e.g. in a function that processes any data.frame).

library(dplyr)
df = data.frame(
    x1 = c(1,2,3,NA),
    x2 = c(1,2,NA,5)
)

df %.%
  filter(complete.cases(x1,x2))

`complete.cases` doesn't just accept vectors. It takes whole data frames, as well. — joran, Mar 12 '14 at 13:56
But that doesn't work as part of `dplyr`'s filter function. I guess I wasn't clear enough and updated my question. — user2503795, Mar 12 '14 at 14:40
It would help if you could demonstrate exactly how it doesn't work with dplyr, but when I try it with filter, it works just fine. — joran, Mar 12 '14 at 14:52

G. Grothendieck · Accepted Answer · 2018-10-17T16:11:25.247

230

Try this:

df %>% na.omit

or this:

df %>% filter(complete.cases(.))

or this:

library(tidyr)
df %>% drop_na

If you want to filter based on one variable's missingness, use a conditional:

df %>% filter(!is.na(x1))

or

df %>% drop_na(x1)

Other answers indicate that of the solutions above na.omit is much slower but that has to be balanced against the fact that it returns row indices of the omitted rows in the na.action attribute whereas the other solutions above do not.

str(df %>% na.omit)
## 'data.frame':   2 obs. of  2 variables:
##  $ x1: num  1 2
##  $ x2: num  1 2
##  - attr(*, "na.action")= 'omit' Named int  3 4
##    ..- attr(*, "names")= chr  "3" "4"

ADDED Have updated to reflect latest version of dplyr and comments.

ADDED Have updated to reflect latest version of tidyr and comments.

edited Oct 17 '18 at 16:11

answered Mar 12 '14 at 16:51

G. Grothendieck

254,981
17
203
341

Just came back to answer and saw your useful answer! – infominer Mar 12 '14 at 20:09
1

Thanks! I added some benchmark results. `na.omit()` performs pretty poorly but the the one is fast. – user2503795 Mar 13 '14 at 08:38
1

This works now as well: `df %>% filter(complete.cases(.))`. Not sure whether recent changes in dplyr made this possible. – user2503795 Mar 24 '15 at 15:13
As @jan-katins points out, the Tidyverse function is called `drop_na`, so you can now do: `df %>% drop_na()`. – cbrnr May 16 '18 at 14:39

score 27 · Answer 2 · edited May 04 '16 at 14:50

27

This works for me:

df %>%
  filter(complete.cases(df))

Or a little more general:

library(dplyr) # 0.4
df %>% filter(complete.cases(.))

This would have the advantage that the data could have been modified in the chain before passing it to the filter.

Another benchmark with more columns:

set.seed(123)
x <- sample(1e5,1e5*26, replace = TRUE)
x[sample(seq_along(x), 1e3)] <- NA
df <- as.data.frame(matrix(x, ncol = 26))
library(microbenchmark)
microbenchmark(
  na.omit = {df %>% na.omit},
  filter.anonymous = {df %>% (function(x) filter(x, complete.cases(x)))},
  rowSums = {df %>% filter(rowSums(is.na(.)) == 0L)},
  filter = {df %>% filter(complete.cases(.))},
  times = 20L,
  unit = "relative")

#Unit: relative
#             expr       min        lq    median         uq       max neval
 #         na.omit 12.252048 11.248707 11.327005 11.0623422 12.823233    20
 #filter.anonymous  1.149305  1.022891  1.013779  0.9948659  4.668691    20
 #         rowSums  2.281002  2.377807  2.420615  2.3467519  5.223077    20
 #          filter  1.000000  1.000000  1.000000  1.0000000  1.000000    20

edited May 04 '16 at 14:50

David Arenburg

91,361
17
137
196

answered May 13 '14 at 11:04

Miha Trošt

2,002
22
25

1

I updated your answer with "." in the complete.cases and added benchmark - hope you don't mind :-) – talat Jan 26 '15 at 11:48
1

I found `df %>% slice(which(complete.cases(.)))` performed ~ 20% faster than the filter-approach in the benchmark above. – talat Jan 26 '15 at 13:49
1

It's worth noting that if you are using this filter in a dplyr pipe with other dplyr commands (such as group_by()), you will need to add `%>% data.frame() %>%` before you try and filter on complete.cases(.) because it won't work on tibbles or grouped tibbles or something. Or at least, that has been the experience I have had. – C. Denney Mar 01 '19 at 17:13
I found that filter(complete.cases(.)) works with 'tibbles' but does not work with grouped 'tibbles'. You can use ungroup() if you want to keep your data in a 'tibble' rather than convert to a proper data frame. – Knackiedoo Jun 10 '23 at 08:26

user2503795 · Answer 3 · 2017-06-01T18:18:47.247

Here are some benchmark results for Grothendieck's reply. na.omit() takes 20x as much time as the other two solutions. I think it would be nice if dplyr had a function for this maybe as part of filter.

library('rbenchmark')
library('dplyr')

n = 5e6
n.na = 100000
df = data.frame(
    x1 = sample(1:10, n, replace=TRUE),
    x2 = sample(1:10, n, replace=TRUE)
)
df$x1[sample(1:n, n.na)] = NA
df$x2[sample(1:n, n.na)] = NA


benchmark(
    df %>% filter(complete.cases(x1,x2)),
    df %>% na.omit(),
    df %>% (function(x) filter(x, complete.cases(x)))()
    , replications=50)

#                                                  test replications elapsed relative
# 3 df %.% (function(x) filter(x, complete.cases(x)))()           50   5.422    1.000
# 1               df %.% filter(complete.cases(x1, x2))           50   6.262    1.155
# 2                                    df %.% na.omit()           50 109.618   20.217

Jan Katins · Answer 4 · 2017-01-09T16:10:37.387

13

This is a short function which lets you specify columns (basically everything which dplyr::select can understand) which should not have any NA values (modeled after pandas df.dropna()):

drop_na <- function(data, ...){
    if (missing(...)){
        f = complete.cases(data)
    } else {
        f <- complete.cases(select_(data, .dots = lazyeval::lazy_dots(...)))
    }
    filter(data, f)
}

[drop_na is now part of tidyr: the above can be replaced by library("tidyr")]

Examples:

library("dplyr")
df <- data.frame(a=c(1,2,3,4,NA), b=c(NA,1,2,3,4), ac=c(1,2,NA,3,4))
df %>% drop_na(a,b)
df %>% drop_na(starts_with("a"))
df %>% drop_na() # drops all rows with NAs

edited Jan 09 '17 at 16:10

answered May 04 '16 at 14:47

Jan Katins

2,219
1
25
35

Wouldn't be even more useful to be able to add a cutoff like 0.5 and have it process by columns? Case: eliminate variables with 50% and over missing data. Example : data[, -which(colMeans(is.na(data)) > 0.5)] It would be nice to be able to do this with tidyr. – Monduiz Feb 09 '17 at 17:02
@Monduiz This would mean that the addition of more data (where a variable then has lots of NA) could fail the next step in the pipeline because a needed variable is now missing... – Jan Katins Feb 09 '17 at 17:21
Right, that makes sense. – Monduiz Feb 09 '17 at 18:13

score 7 · Answer 5 · answered Mar 12 '14 at 13:59

7

try this

df[complete.cases(df),] #output to console

OR even this

df.complete <- df[complete.cases(df),] #assign to a new data.frame

The above commands take care of checking for completeness for all the columns (variable) in your data.frame.

answered Mar 12 '14 at 13:59

infominer

1,981
13
17

Thanks. I guess I wasn't clear enough though (question updated). I know about complete.cases(df) but I would like to do it with `dplyr` as part of the filter function. That would allow a neat integration in dplyr chains etc. – user2503795 Mar 12 '14 at 14:38
In `dplyr:::do.data.frame` the statement `env$. <- .data` adds dot to the environment. No such statement in magrittr::"%>%"` – G. Grothendieck May 13 '14 at 11:47
Sorry must have entered the comment in wrong place. – G. Grothendieck May 13 '14 at 12:09

score 3 · Answer 6 · answered Jun 23 '16 at 08:09

3

Just for the sake of completeness, dplyr::filter can be avoided altogether but still be able to compose chains just by using magrittr:extract (an alias of [):

library(magrittr)
df = data.frame(
  x1 = c(1,2,3,NA),
  x2 = c(1,2,NA,5))

df %>%
  extract(complete.cases(.), )

The additional bonus is speed, this is the fastest method among the filter and na.omit variants (tested using @Miha Trošt microbenchmarks).

answered Jun 23 '16 at 08:09

mbask

2,471
18
17

When I do the benchmark with the data by Miha Trošt, I find that using `extract()` is almost ten times slower than `filter()`. However, when I create a smaller data frame with `df <- df[1:100, 1:10]`, the picture changes and `extract()` is the fastest. – Stibu Jan 20 '17 at 08:39
You are correct. It looks like `magrittr::extract` is the fastest way only when `n <= 5e3` in Miha Trošt benchmark. – mbask Jan 24 '17 at 08:21

score 1 · Answer 7 · answered Feb 08 '21 at 21:48

dplyr >= 1.0.4

if_any and if_all are available in newer versions of dplyr to apply across-like syntax in the filter function. This could be useful if you had other variables in your dataframe that were not part of what you considered complete case. For example, if you only wanted non-missing rows in columns that start with "x":

library(dplyr)
df = data.frame(
  x1 = c(1,2,3,NA),
  x2 = c(1,2,NA,5),
  y = c(NA, "A", "B", "C")
)

df %>% 
  dplyr::filter(if_all(starts_with("x"), ~!is.na(.)))

  x1 x2    y
1  1  1 <NA>
2  2  2    A

For more information on these functions see this link.

Filter for complete cases in data.frame using dplyr (case-wise deletion)

7 Answers7

Linked

Related