0

When I try to filter a dataframe using the %in% operator using a pull() subfilter, It does not work. However, When I store the pull() subquery in a variable, and then use the %in% operator on the variable, It does work.

I use as an example the well known mtcars dataset.

library(tidyverse)
mydf <- tibble(mtcars)

Say I want all the observations who share cyl+am+vs The following code does not work:

mydf |> filter(mpg %in% 
mydf |> filter(duplicated(paste0(cyl,am,vs))) |> pull(mpg)
)

Error:

Error in `filter()`:
ℹ In argument: `pull(...)`.
Caused by error in `UseMethod()`:
! no applicable method for 'filter' applied to an object of class "logical"

However, the same structure, using a variable work:

mpg_as_var <- mydf |> filter(duplicated(paste0(cyl,am,vs))) |> pull(mpg)
mydf |> filter(mpg %in% mpg_as_var)

I don't want to just take the duplicates, but also the first duplicated observations. otherwise it would've been a simple filter(duplicated()) query

Got any ideas?

Yann
  • 3
  • 1
  • 1
    What exactly are you looking for? when you say *share cyl+am+vs* do you mean `cyl == am == vs`?? – Onyambu Jul 19 '23 at 13:18
  • Your requested query sounds like it will also output rows which share ‘mpg’ with a row with duplicated values. That doesn’t arise in this particular dataset but could with others. – Jon Spring Jul 19 '23 at 13:23
  • 1
    You have an order of operations problem. The `%in%` and `|>` have the same operator precedence (per the `?Syntax` help page) so they are evaluated left to right. But you seem to want to have the `|>` happen "first". Use parenthesis to control the order of operations. If you look at the output of `quote(mydf |> filter(mpg %in% mydf |> filter(duplicated(paste0(cyl,am,vs))) |> pull(mpg)))` you can see how it's interpreted by the parser. – MrFlick Jul 19 '23 at 13:23
  • https://stackoverflow.com/questions/28244123/find-duplicated-elements-with-dplyr Offers other approaches which will address the edge case I described above. – Jon Spring Jul 19 '23 at 13:26

1 Answers1

0

Use parenthesis around the vector you create, e.g.

      mydf |> filter(mpg %in% 
                       (mydf |> filter(duplicated(paste0(cyl,am,vs))) |> pull(mpg))
      )

# A tibble: 28 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 6  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
Julian
  • 6,586
  • 2
  • 9
  • 33
  • lol. what a simple solution. thanks. kinda shame i didnt think to do it myself. good ol' `c()`. I cant vote it up but it is a great solution. Obviously the program didn't understand i intended to evaluate the pull first. makes a lot of sense – Yann Jul 19 '23 at 13:49