0

I have a dataset that has 2 columns; column A is State_Name and has 5 different options of state, and column B is Total_Spend which has the average total spend of that state per day. There are 365 observations for each state.

What I want to do is count the number of outliers PER STATE using the 1.5 IQR rule and save the count of outliers per state to a new df or table.

So I would expect an output something like:

State Outlier Count
ATL 5
GA 20
MI 11
NY 50
TX 23

I have managed to get it to work by doing it one state at a time but I can't figure out what to do to achieve this in a single go.

Here is my code at the moment (to return the result for a single state):

  daily_agg %>% 
  select(State_Name, Total_Spend) %>%
  filter(State_Name == "NY")

outlier_NY <- length(boxplot.stats(outlier_df$Total_Spend)$out)

Any help would be appreciated.

Thanks!

EDIT WITH TEST DATASET


outlier_mtcars <- 
  df %>%  
  select(cyl, disp) %>%
  filter(cyl == "6")
  
outliers <- length(boxplot.stats(outlier_mtcars$disp)$out)

The above shows me 1 outlier for 6 cyl cars but I want a table that shows how many outliers for 4, 6, 8 cyl cars

K-J
  • 99
  • 9
  • Have you tried grouping by state? `%>% group_by(State) %>%` then you can do the same stuff for each of the groups, probably with `map` function. – wernor Aug 22 '22 at 12:23
  • I tried the below and didn't work; it creates the new column but it is all 0's. I haven't played much with the map function. Any ideas on how to achieve what I'm looking for using map? ```daily_agg %>% select(State_Name, Total_Spend) %>% group_by(State_Name) %>% mutate("Outlier Count" = length(boxplot.stats(daily_agg$Total_Spend)$out)) ``` – K-J Aug 22 '22 at 12:32
  • With no data I can't really try but `mutate` keeps a data frame with the same number of rows. It seems you want just one row per "State_Name" so you would need `summarise` /`summarize` instead of `mutate`. – wernor Aug 22 '22 at 12:40
  • Or, if it works with `mutate` you could then use `slice` to keep just one row for each "State_Name" since the new column will have the same value for each row within the group. – wernor Aug 22 '22 at 12:42
  • Nope mutate doesn't work and gives me all 0's. I didn't recall mutate leaves the original rows in so thank you for that tip. I tried with summarise but still couldn't get it to work. I have edited my original post with a dataset that does have an outlier in one of the variable factors in case you can help me based on that? – K-J Aug 22 '22 at 12:50

2 Answers2

2

Since I'm not very familiar with the function boxplot.stats, I didn't use this in my solution and instead manually calculates 1.5 * IQR + upper quantile.

Here mtcars was used as an example. For the records that are outliers, they are "flagged" as TRUE, where we can sum them up in summarize.

library(dplyr)

mtcars %>% 
  group_by(cyl) %>% 
  mutate(flag = disp >= (IQR(disp) * 1.5 + quantile(disp, probs = 0.75)), .keep = "used") %>% 
  summarize(Outlier = sum(flag))

# A tibble: 3 × 2
    cyl Outlier
  <dbl>   <int>
1     4       0
2     6       1
3     8       0
benson23
  • 16,369
  • 9
  • 19
  • 38
  • Thank you for your comment. I used your exact code on the mtcars dataset but get an output of 0: ```> mtcars %>% group_by(cyl) %>% + mutate(flag = disp >= (IQR(disp) * 1.5 + quantile(disp, probs = 0.75)), .keep = "used") %>% + summarise(Outlier = sum(flag)) Outlier 1 0 ``` – K-J Aug 22 '22 at 13:22
  • If you can confirm your `mtcars` is unmodified, let's try doing it step-by-step. If you only run the first two lines (`mtcars %>% group_by(cyl) %>% mutate(flag = disp >= (IQR(disp) * 1.5 + quantile(disp, probs = 0.75)), .keep = "used")`), does the output only has three columns (`cyl`, `disp` and `flag`)? If so, is there only one `TRUE` record across the `flag` column? Also, it would be helpful if you check your `dplyr` version using `packageVersion("dplyr")`. Mine is version 1.0.8 :) – benson23 Aug 22 '22 at 13:34
  • 1
    I restarted my r session after reading that installing plyr after installing dplyr creates problems when using dplyr functions and now this works perfectly! and such a simple solution - thank you so much – K-J Aug 22 '22 at 13:59
1

Since I don't have your data, I'll make some up with the two columns you mention:

df<-data.frame(state=sample(c("ny","fl"),100, replace=TRUE),
           spend=sample(1:100, 100, replace=TRUE))

> head(df)
  state spend
1    ny     3
2    fl    87
3    ny    91
4    fl    97
5    ny    47
6    fl     8

Then set your upper and lower bounds (could be quartiles, absolutes, whatever..)

df%>%
  group_by(state)%>%
  mutate(lower_bound=quantile(spend,0.25),
         upper_bound=quantile(spend,0.75))%>%
  mutate(is_outlier=if_else(spend<lower_bound|spend>upper_bound,TRUE,FALSE))

# A tibble: 10 × 5
# Groups:   state [2]
   state spend lower_bound upper_bound is_outlier
   <chr> <int>       <dbl>       <dbl> <lgl>     
 1 ny        3          38          84 TRUE      
 2 fl       87          26          87 FALSE     
 3 ny       91          38          84 TRUE      
 4 fl       97          26          87 TRUE 

Then if you only want to see the output, summarise by is_outlier:

df%>%
  group_by(state)%>%
  mutate(lower_bound=quantile(spend,0.25),upper_bound=quantile(spend,0.75))%>%
  mutate(is_outlier=if_else(spend<lower_bound|spend>upper_bound,TRUE,FALSE))%>%
  summarise(outliers=sum(is_outlier))

  state         outliers
  <chr>             <int>
1 fl                   19
2 ny                   30
  • Thanks Justin. This was also a really eloquent solution. I have already accepted someone's answer but have upvoted yours. FYI for anyone else reading this, this was the code used in the mutate line to flag for outliers using the 1.5 IQR rule: ```mutate(lower_bound=(quantile(DAILY_DEMAND,0.25))-(1.5 * IQR(DAILY_DEMAND)), upper_bound=(quantile(DAILY_DEMAND,0.75))+(1.5 * IQR(DAILY_DEMAND))) ``` – K-J Aug 22 '22 at 14:07