r remove outliers from a list of data.frames and make a new list of data.frames?

Question

I have a List of 6 in a data.frame

It has 3 columns:

id, T_C, Sales

T_C is TEST or CONTROL

Someone helped me here and I learned how to find the mean() and sd() by looping, instead of doing individual statements.

Now my goal is to remove the outliers from the 6 lists and produce a List of 6 (after removing outliers).

str(dfList) # this is the list of 6 in data.frames

I am able to get the mean() and sd() of each list like this:

list_mean_sd <- lapply(dfList,
                       function(df) 
                        {
                         df %>%
                           group_by(TC_INDICATOR) %>%
                           summarise(mean = mean(NET_SPEND),
                                     sd = sd(NET_SPEND))
                        })

> str(list_mean_sd)
List of 6  (1 obs. of  2 variables:)

I can selected them individually for mean or sd:

sapply(list_mean_sd, "[", "mean")
sapply(list_mean_sd, "[", "sd")

Basically, my goal is to id the outliers and remove them, product an alternative set, or after-set.

**outliers are:  mean - 3*sd()  or  mean + 3*sd()

I have this done, but with more manually steps, looking to learn how to loop through these sets and stuff like that, thanks in advance for helping me!

score 0 · Accepted Answer · edited Jun 20 '20 at 09:12

Give this a shot. First I create data which I split into six data frames which are housed in a list.

set.seed(0)
test_data <- data.frame(id = 1:10000, 
                        T_C = sample(c(TRUE, FALSE), size = 10000, replace = TRUE),
                        Sales = rnorm(n = 10000),
                        grp = sample(c("a", "b", "c", "d", "e", "f"), 
                                     size = 10000, replace = TRUE))

test_split <- split(test_data, test_data$grp)

Then, I use lapply on this list to identify what I'm calling the z_scores which are computed as the difference between the mean of Sales and each individual Sales divided by the sd of Sales. Finally, we use filter on these to pull out the ones which have a z_score with an absolute value over 3.

library(dplyr)
outlier_list <- lapply(test_split, 
       function(m) group_by(m, T_C) %>% mutate(z_score = (Sales - mean(Sales)) / sd(Sales)) %>%
         ungroup() %>% filter(abs(z_score) >= 3)
)

> outlier_list
$a
# A tibble: 5 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   468  TRUE -2.995332      a -3.073314
2  3026  TRUE  3.028495      a  3.075258
3  5188  TRUE -3.097847      a -3.177952
4  7993 FALSE -3.571076      a -3.823983
5  9105  TRUE -3.216710      a -3.299276

$b
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   264  TRUE  3.003494      b  3.003329
2  2172  TRUE  3.001475      b  3.001326
3  2980 FALSE -3.176356      b -3.222782
4  3366 FALSE  3.009292      b  3.048559
5  7477 FALSE  3.348301      b  3.392265
6  7583  TRUE -3.089758      b -3.040911

$c
# A tibble: 2 × 5
     id   T_C    Sales    grp  z_score
  <int> <lgl>    <dbl> <fctr>    <dbl>
1  8078  TRUE 3.015343      c 3.129923
2  8991 FALSE 3.113526      c 3.058302

$d
# A tibble: 5 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   544  TRUE  3.289070      d  3.168235
2  3791 FALSE  3.791938      d  3.769810
3  6771 FALSE -3.157741      d -3.166861
4  7864  TRUE  3.164128      d  3.045728
5  9371  TRUE -3.026884      d -3.024655

$e
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1   186 FALSE  3.021541      e  3.046079
2  1211  TRUE  3.414337      e  3.343521
3  1665  TRUE  3.546282      e  3.473614
4  3765 FALSE  3.363641      e  3.391142
5  4172  TRUE  3.348820      e  3.278923
6  7973 FALSE -2.987790      e -3.015284

$f
# A tibble: 6 × 5
     id   T_C     Sales    grp   z_score
  <int> <lgl>     <dbl> <fctr>     <dbl>
1  1089  TRUE -3.195090      f -3.189979
2  2452 FALSE  3.287591      f  3.212317
3  3486 FALSE -3.334942      f -3.367962
4  4198 FALSE -3.102578      f -3.137082
5  8183  TRUE  3.081077      f  3.075324
6  8656  TRUE  3.253873      f  3.247822

Obviously, this will give you only the outliers. If you want to keep only the inliers, you change the >= 3 to a < 3.

Updated to get Wilcox test on inliers

inlier_list <- lapply(test_split, 
                       function(m) group_by(m, T_C) %>% 
                        mutate(z_score = (Sales - mean(Sales)) / sd(Sales)) %>%
                         ungroup() %>% filter(abs(z_score) < 3)
)

We just run lapply on the list of inliers using the parameters noted in OP's comment.

wilcox_test_res <- lapply(inlier_list, 
                          function(m) wilcox.test(m$Sales ~ m$T_C, 
                                                  mu= mean(m$Sales[m$T_C == TRUE]), 
                                                  conf.level=0.95,

Thanks! I won't be at work until Tuesday to try it out. What is the Group for ? I took a class on R, but I am finding that I really have to use it to get any good, the class covers too much in too little time. — Ray Kodiak, Dec 23 '16 at 01:30
In this context, `group_by` allows me to get the `mean` and `sd` aggregated by the `T_C` column within each `data.frame`. Since I use `mutate` after the `group_by`, the column I add with `mutate` is computed based on that grouping. — Nick Criswell, Dec 23 '16 at 16:09
That is just a dummy variable I made so I could split the simulated data into six sets. I used the 'grp' column to do that. — Nick Criswell, Dec 24 '16 at 18:07
This worked great, how can I employ the loop inside the Wilcox.test, so that it looks over all these 6 lists in one go ? — Ray Kodiak, Dec 27 '16 at 17:58
this is what I use now: wilcox.test(NET_SPEND ~ TC_INDICATOR, mu=means["CONTROL"], conf.level=0.95, paired=F, data=lessoutliers_201611) — Ray Kodiak, Dec 27 '16 at 17:58
I took a shot at this, but I'm not familiar with the `wilcox.text` function in `R` at all. I just got an inlier list and ran the function you built on that list with `lapply`. Not sure if that is what you are going for or not. Your best bet might be to ask a new question that specifically details how you are trying to use this test. — Nick Criswell, Dec 27 '16 at 18:10

r remove outliers from a list of data.frames and make a new list of data.frames?

1 Answers1

Updated to get Wilcox test on inliers