20

When I use filter from the dplyr package to drop a level of a factor variable, filter also drops the NA values. Here's an example:

library(dplyr)
set.seed(919)
(dat <- data.frame(var1 = factor(sample(c(1:3, NA), size = 10, replace = T))))
#    var1
# 1  <NA>
# 2     3
# 3     3
# 4     1
# 5     1
# 6  <NA>
# 7     2
# 8     2
# 9  <NA>
# 10    1

filter(dat, var1 != 1)
#   var1
# 1    3
# 2    3
# 3    2
# 4    2

This does not seem ideal -- I only wanted to drop rows where var1 == 1.

It looks like this is occurring because any comparison with NA returns NA, which filter then drops. So, for example, filter(dat, !(var1 %in% 1)) produces the correct results. But is there a way to tell filter not to drop the NA values?

zx8754
  • 52,746
  • 12
  • 114
  • 209
Jake Fisher
  • 3,220
  • 3
  • 26
  • 39
  • 2
    @akrun For some reason I didn't get this notification :P. Well I thought that the OP already knows about this, as he mentioned `filter(dat, !(var1 %in% 1))` which is similar, but I think this would be the only way to do it with `dplyr::filter`. – LyzandeR Oct 02 '15 at 14:31
  • 1
    I don't think there is a way to explicitly tell `filter` not to drop `NA` values but in general, logical NA queries can be intuitively handled using the base `%in%` operator and it's negation, defined as `%ni% <- Negate('%in%')`. Thus, you could use `filter(dat, var1 %ni% 1)` which will work. See http://stackoverflow.com/a/11303276/4269699 and http://stackoverflow.com/a/27015823/4269699 – wjchulme Oct 02 '15 at 16:08
  • 2
    Yes, I did know about both this approach and the approach that @LyzandeR used for an answer. It looks like filter doesn't have an explicit option for "keep NA", so these workarounds will be fine. Thanks for your help. – Jake Fisher Oct 02 '15 at 17:07
  • 1
    Argh this happened to me and I was going crazy trying to understand why I was losing so much data. Agreed this seems like it is not ideal... – Arthur Yip Jun 23 '17 at 01:51

3 Answers3

24

You could use this:

 filter(dat, var1 != 1 | is.na(var1))
  var1
1 <NA>
2    3
3    3
4 <NA>
5    2
6    2
7 <NA>

And it won't.

Also just for completion, dropping NAs is the intended behavior of filter as you can see from the following:

test_that("filter discards NA", {
  temp <- data.frame(
    i = 1:5,
    x = c(NA, 1L, 1L, 0L, 0L)
  )
  res <- filter(temp, x == 1)
  expect_equal(nrow(res), 2L)
})

This test above was taken from the tests for filter from github.

LyzandeR
  • 37,047
  • 12
  • 77
  • 87
  • 7
    Venturing a bit into opinion-based territory, do you have an idea why this is the chosen approach? This behavior was unexpected to me (I got bitten by it today). – Heisenberg Dec 09 '16 at 22:08
  • 1
    @Heisenberg I am assuming according to Hadley most people would like not to get any NAs when filtering. But that is a question for the developer / maintainer i.e. Hadley. – LyzandeR Dec 10 '16 at 18:35
1

The answers previously given are good, but when your filter statement involves a function of many fields, the work around might not be so great. Also, who wants to use mapply the non-vectorized identical. Here is another somewhat simpler solution using coalesce

filter(dat, coalesce( var1 != 1, TRUE))
Harlan Nelson
  • 1,394
  • 1
  • 10
  • 22
0

I often map identical with mapply...

(note: I believe because of changes in R 3.6.0, set.seed and sample end up with different test data)

library(dplyr, warn.conflicts = FALSE)
set.seed(919)
(dat <- data.frame(var1 = factor(sample(c(1:3, NA), size = 10, replace = T))))
#>    var1
#> 1     3
#> 2     1
#> 3  <NA>
#> 4     3
#> 5     1
#> 6     3
#> 7     2
#> 8     3
#> 9     2
#> 10    1

filter(dat, var1 != 1)
#>   var1
#> 1    3
#> 2    3
#> 3    3
#> 4    2
#> 5    3
#> 6    2

filter(dat, !mapply(identical, as.numeric(var1), 1))
#>   var1
#> 1    3
#> 2 <NA>
#> 3    3
#> 4    3
#> 5    2
#> 6    3
#> 7    2

it works for numerics and strings as well (probably more common use case)...

library(dplyr, warn.conflicts = FALSE)
set.seed(919)
(dat <- data.frame(var1 = sample(c(1:3, NA), size = 10, replace = T),
                   var2 = letters[sample(c(1:3, NA), size = 10, replace = T)],
                   stringsAsFactors = FALSE))
#>    var1 var2
#> 1     3 <NA>
#> 2     1    a
#> 3    NA    a
#> 4     3    b
#> 5     1    b
#> 6     3 <NA>
#> 7     2    a
#> 8     3    c
#> 9     2 <NA>
#> 10    1    b

filter(dat, !mapply(identical, var1, 1L))
#>   var1 var2
#> 1    3 <NA>
#> 2   NA    a
#> 3    3    b
#> 4    3 <NA>
#> 5    2    a
#> 6    3    c
#> 7    2 <NA>

filter(dat, !mapply(identical, var2, 'a'))
#>   var1 var2
#> 1    3 <NA>
#> 2    3    b
#> 3    1    b
#> 4    3 <NA>
#> 5    3    c
#> 6    2 <NA>
#> 7    1    b
CJ Yetman
  • 8,373
  • 2
  • 24
  • 56