0

I want to use this IQR function:

    smooth_outliers <- function(x, na.rm = TRUE, ...) {
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.3 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- round(qnt[1] - H)
  y[x > (qnt[2] + H)] <- round(qnt[2] + H)
  y
}

on the below df, on the total column for every specific key, based on the key column:

    key total
US4ZNB  10
US4ZNB  1075
US4ZNB  10000
US4ZNB  1138
US4ZNB  1156
US4YYM  1114
US4YYM  1072
US4YYM  50
US4YYM  1181
US4YYM  8000
JM4YYM  15000
JM4YYM  2000
JM4YYM  100
JM4YYM  2200
JM4YYM  2300
Matan Retzer
  • 65
  • 1
  • 7
  • Perhaps I don't understand your question. If your data is in df, then df$smooth <- smooth_outliers(df$total) will properly use your function to smooth any outliers. However, by the criteria used in your function for identifying outliers, there are no outliers so the function correctly returns the input data unchanged. – WaltS Jun 27 '18 at 12:12
  • you are right, I changed the df, so now there are outliers, but my issue is to use this function per key, so it should work on the 3 keys separately, because for every key there should be different distribution. for example for the key: `US4ZNB` the function should work on it's 5 values, and similarly for every key. – Matan Retzer Jun 28 '18 at 05:19

2 Answers2

0

ddply from the plyr package does exactly this. It applies a function to each subset of the data based off a column.

plyr::ddply(df, "key", plyr::numcolwise(smooth_outliers))

The first argument is your data with "key" and "total", the second argument is the grouping variable, in this case "key".

The final variable is the function you want to apply, the numcolwise function is used here essentially so it applied it to the column rather than a whole row. So we make the row-based smooth-outliers function a column based function.

Then voila.

You'll get a data frame that lists each each key and its IQR as calculated by the smooth_outliers function.

Here's the result.

      key total
1  JM4YYM  1421
2  JM4YYM  1712
3  JM4YYM  1709
4  US4YYM  1114
5  US4YYM  1473
6  US4YYM  1181
7  US4YYM  1767
8  US4YYM  1005
9  US4ZAW  1138
10 US4ZAW  1156
11 US4ZAW  1982
12 US4ZNB  1338
13 US4ZNB  1075
14 US4ZNB  1806

As you can see, each key is matched up with one of the outputs from the smooth_outliers function.

LachlanO
  • 1,152
  • 8
  • 14
  • Thanks for your question. maybe I didn't understand but I still didn't get what I needed. for example how I can get from: ` key total 1 US4ZNB 10-->1000+- 2 US4ZNB 1075 3 US4ZNB 1806`. I mean what exactly do you mean by: make the row-based smooth-outliers function a column based function. – Matan Retzer Jun 27 '18 at 07:39
0

After ideas elaboration, I manage to find solution for my issue. I just used dplyr::group_by:

df.new <- df %>% group_by(key) %>% mutate(val=smooth_outliers(total))

Thanks you all.

Matan Retzer
  • 65
  • 1
  • 7