4

Is this expected behavior of filter in dplyr? Sounds horrendous. Am I missing something, or have the wrong version?

mydf <- data.frame(x = 1:5, y = c(letters[1:3], rep(NA, 2)))
mydf
  x    y
1 1    a
2 2    b
3 3    c
4 4 <NA>
5 5 <NA>

filter(mydf, y != 'a')
  x y
1 2 b
2 3 c

packageVersion('dplyr')
[1] ‘0.7.2’
M--
  • 25,431
  • 8
  • 61
  • 93
Gopala
  • 10,363
  • 7
  • 45
  • 77
  • 1
    The `filter` has been in that way for a long time You may need `filter(mydf, y != 'a' |is.na(y))` I just checked with `R 3..1.3` and `dplyr_0.4.3` and it gives the same output as yours – akrun Nov 07 '17 at 17:07
  • 1
    OMG - I have no idea how many bugs I introduced in my code without realizing this behavior. – Gopala Nov 07 '17 at 17:20

1 Answers1

3

It's right there in the documentation for ?dplyr (although it seems like this was only added to the documentation 9 months ago):

Use filter() find rows/cases where conditions are true. Unlike base subsetting, rows where the condition evaluates to NA are dropped.

This is consistent with the way base::subset() works, but not how subsetting with [+logical indexing works.

As @akrun says in comments, you can use filter(mydf, y != 'a' |is.na(y)) to preserve NA values. It would be nice to be able to use identical() or isTRUE(), but these aren't vectorized. You could write a convenience wrapper:

eq <- function(x,c) {x==c | is.na(x)}
filter(mydf,eq(y,"a"))
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453