0

I use dplyr quite a lot for data wrangling, but I never figured out dplyr filter behaviour when using filter(df, variable == c(value1, value2)

Lets use iris data set as an example.

library(dplyr)
data(iris)

# I want to filter by Species 'setosa' and 'versicolor'
# Solution 1
filter1 <- filter(iris, Species == 'setosa' | Species == 'versicolor')

nrow(filter1)
[1] 100 # expected result

# Solution 2
filter2 <- filter(iris, Species %in% c('setosa', 'versicolor'))

nrow(filter2)
[1] 100 # expected result

filter1 == filter2 # both solutions return the exact same result

#Solution 3
filter3 <- filter(iris, Species == c('setosa', 'versicolor'))

nrow(filter3)
[1] 50 # unexpected result

unique(filter3$Species)
[1] setosa     versicolor
Levels: setosa versicolor virginica

Although Solution 3 is filtering for the intended species, as shown by unique(filter3$Species), it only returns half of the occurrences (50 compared to 100 in Solution 1and Solution2). I would appreciate some guidance on what is actually going on in Solution 3.

FAmorim
  • 300
  • 2
  • 14
  • 2
    It's recycling `c('setosa', 'versicolor')` to match the length of `Species`, so it's only matching 50% of the time in the 100 rows with those `Species`. Try `c("a", "b", "a", "b") == c("a", "b")` and `c("a", "b", "b", "a") == c("a", "b")` to see the difference. – caldwellst Feb 11 '22 at 10:50
  • You can see the details in this post: [What is the difference between \`%in%\` and \`==\`?](https://stackoverflow.com/questions/15358006/what-is-the-difference-between-in-and) – caldwellst Feb 11 '22 at 10:52
  • I see, thank you for your comment! Based on my specific example, when doing `iris$Species == c('setosa', 'versicolor')` this behaviour becomes quite clear! – FAmorim Feb 11 '22 at 11:10

1 Answers1

0

filter(iris, Species == c("versicolor", "setosa")) does not make sense in an intuitive way, because one Species is not a 2-tuple:

> "setosa" == c("setosa", "versicolor")
  [1]  TRUE FALSE

Interestingly, filter(iris, Species == c("setosa", "versicolor")) produce the same results: The first Species of the data frame will be returned, so descending sorting will give you versicolor:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris %>%
  as_tibble()
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

iris %>%
  filter(Species == c('setosa', 'versicolor')) %>%
  as_tibble()
#> # A tibble: 50 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          4.9         3            1.4         0.2 setosa 
#>  2          4.6         3.1          1.5         0.2 setosa 
#>  3          5.4         3.9          1.7         0.4 setosa 
#>  4          5           3.4          1.5         0.2 setosa 
#>  5          4.9         3.1          1.5         0.1 setosa 
#>  6          4.8         3.4          1.6         0.2 setosa 
#>  7          4.3         3            1.1         0.1 setosa 
#>  8          5.7         4.4          1.5         0.4 setosa 
#>  9          5.1         3.5          1.4         0.3 setosa 
#> 10          5.1         3.8          1.5         0.3 setosa 
#> # … with 40 more rows

iris %>%
  arrange(Species) %>%
  filter(Species == c('versicolor', 'setosa')) %>%
  as_tibble()
#> # A tibble: 50 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          4.9         3            1.4         0.2 setosa 
#>  2          4.6         3.1          1.5         0.2 setosa 
#>  3          5.4         3.9          1.7         0.4 setosa 
#>  4          5           3.4          1.5         0.2 setosa 
#>  5          4.9         3.1          1.5         0.1 setosa 
#>  6          4.8         3.4          1.6         0.2 setosa 
#>  7          4.3         3            1.1         0.1 setosa 
#>  8          5.7         4.4          1.5         0.4 setosa 
#>  9          5.1         3.5          1.4         0.3 setosa 
#> 10          5.1         3.8          1.5         0.3 setosa 
#> # … with 40 more rows


iris %>%
  arrange(desc(Species)) %>%
  filter(Species == c('setosa', 'versicolor')) %>%
  as_tibble()
#> # A tibble: 50 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          6.4         3.2          4.5         1.5 versicolor
#>  2          5.5         2.3          4           1.3 versicolor
#>  3          5.7         2.8          4.5         1.3 versicolor
#>  4          4.9         2.4          3.3         1   versicolor
#>  5          5.2         2.7          3.9         1.4 versicolor
#>  6          5.9         3            4.2         1.5 versicolor
#>  7          6.1         2.9          4.7         1.4 versicolor
#>  8          6.7         3.1          4.4         1.4 versicolor
#>  9          5.8         2.7          4.1         1   versicolor
#> 10          5.6         2.5          3.9         1.1 versicolor
#> # … with 40 more rows

iris %>%
  arrange(desc(Species)) %>%
  filter(Species == c('versicolor', 'setosa')) %>%
  as_tibble()
#> # A tibble: 50 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          7           3.2          4.7         1.4 versicolor
#>  2          6.9         3.1          4.9         1.5 versicolor
#>  3          6.5         2.8          4.6         1.5 versicolor
#>  4          6.3         3.3          4.7         1.6 versicolor
#>  5          6.6         2.9          4.6         1.3 versicolor
#>  6          5           2            3.5         1   versicolor
#>  7          6           2.2          4           1   versicolor
#>  8          5.6         2.9          3.6         1.3 versicolor
#>  9          5.6         3            4.5         1.5 versicolor
#> 10          6.2         2.2          4.5         1.5 versicolor
#> # … with 40 more rows

Created on 2022-02-11 by the reprex package (v2.0.0)

danlooo
  • 10,067
  • 2
  • 8
  • 22