Is it possible with dplyr to filter a dataframe with output created by summarize within one pipe?

Question

I got a dataframe with one numerical value and one 5 level factor variable.

# set seed for reproducibility
set.seed(123)
df <- tibble(group = rep(c("a", "b", "c", "d", "e"), each = 20),
             values = c(rnorm(20, 0, 1), rnorm(20, 1, 1), rnorm(20, 2, 1),
                        rnorm(20, 3, 1), rnorm(20, 4, 1)))

I want to use summarize to get the quantiles like

df %>% 
  group_by(group) %>%
  summarize(quantiles = quantile(values, c(0.25, 0.75))) 


df %>% 
  group_by(group) %>%
  summarize(quantile0.25 = quantile(values, c(0.25)), 
            quantile0.75 = quantile(values, c(0.75)))

Either one of these. I don't know which would be more practical, getting the quantiles per one row with two variables or two rows as one variable.

And finally i want (preferably in the same pipe) use the quantiles to filter for outliers in the original dataframe, not the summarize dataframe, in each respective group, like

df %>% 
  group_by() %>%
  summarize() %>%
  filter()

where each group is filtered by their respective quantiles+-1,5IQR.

Is this possible, what would be the best approach? I think it would be straightforward to filter by group with one filter value that gets applied to all groups, but how do I apply a different filter value for each group?

I think it'd be better if you do `group_by` + `mutate` instead of `summarize`. Then you will add the columns of lower and upper quantiles using your 2nd method, and refer to those in `filter`. — yarnabrina, Apr 08 '21 at 04:16

score 4 · Accepted Answer · edited Apr 28 '21 at 16:32

You can write a function to detect outliers via IQR

is_iqr_outlier <- function(x) {
   q <- quantile(x, c(0.25, 0.75))
   iqr <- diff(q)
   (x < q[1] - 1.5*iqr) | (x > q[2] + 1.5*iqr)
}

And then you can just use that in the filter

df %>% 
  group_by(group) %>%
  filter(!is_iqr_outlier(values))

The filter will operate by group. Your sample data doesn't seem to have any outliers so it's not a great test case.

Is it possible with dplyr to filter a dataframe with output created by summarize within one pipe?

1 Answers1