9

I'd like to replace all values in my relatively large R dataset which take values above the 95th and below the 5th percentile, with those percentile values respectively. My aim is to avoid simply cropping these outliers from the data entirely.

Any advice would be much appreciated, I can't find any information on how to do this anywhere else.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Bobbo
  • 95
  • 1
  • 1
  • 5
  • 2
    Besides there being many more details required to answer this question are you really sure you want to do this? A relatively large data set of say 100 numbers, will have 5 values below the 5th percentile and 5 above the 95th percentile if there are no outliers. – John Nov 12 '12 at 07:25
  • Take great care when taking these kinds of measures, you are drastically changing the statistics of your dataset. If this is valid depends on what you are trying to get from the data, and the distribution of the data (e.g. normally distributed). – Paul Hiemstra Nov 12 '12 at 07:37
  • @RobS be careful with using `=` as an assignment operator. The `<-` can be compounded, but `=` can **not** – Ricardo Saporta Nov 12 '12 at 08:11
  • 1
    I almost always use `=`, and I've rarely run into trouble. Only in calls like `system.time(bla <- spam())` is the `<-` compulsory. – Paul Hiemstra Nov 12 '12 at 08:49
  • Bobbo, the missing details would include what the model is and how you're defining your percentiles; whether you wanted empirical cutoffs derived from the data or cutoffs derived from a model and what that model is; and specifically how you wanted the data points replaced... replace with random values using the model parameters?... some other form of imputation? tack back onto the end? Additionally, what you're doing doesn't test robustness by itself. It would require adding something else. – John Nov 12 '12 at 14:56

4 Answers4

21

This would do it.

fun <- function(x){
    quantiles <- quantile( x, c(.05, .95 ) )
    x[ x < quantiles[1] ] <- quantiles[1]
    x[ x > quantiles[2] ] <- quantiles[2]
    x
}
fun( yourdata )
Romain Francois
  • 17,432
  • 3
  • 51
  • 77
  • Thank you, works like a dream. I'm new to this website, is there any way I can give you rep or something for this answer? – Bobbo Nov 12 '12 at 07:45
  • you can up the answer(s) and accept it (you accepted it already). See http://stackoverflow.com/faq which will also give you a badge if you read them all – Romain Francois Nov 12 '12 at 07:56
  • The above snippet will also replace NAs (if any) by the quantile values! – Bolaka Nov 18 '14 at 12:13
  • check the .clip function from pandas https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.quantile.html as well – Jia Gao Nov 21 '21 at 13:45
12

You can do it in one line of code using squish():

d2 <- squish(d, quantile(d, c(.05, .95)))



In the scales library, look at ?squish and ?discard

#--------------------------------
library(scales)

pr <- .95
q  <- quantile(d, c(1-pr, pr))
d2 <- squish(d, q)
#---------------------------------

# Note: depending on your needs, you may want to round off the quantile, ie:
q <- round(quantile(d, c(1-pr, pr)))

example:

d <- 1:20
d
# [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20


d2 <- squish(d, round(quantile(d, c(.05, .95))))
d2
# [1]  2  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 19
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Nice. Or you could roll squish into your own function. `cap <- function(x, low, high) pmin(high, pmax(low, x))` – Ben Jan 28 '20 at 15:03
3

I used this code to get what you need:

qn = quantile(df$value, c(0.05, 0.95), na.rm = TRUE)
df = within(df, { value = ifelse(value < qn[1], qn[1], value)
                  value = ifelse(value > qn[2], qn[2], value)})

where df is your data.frame, and value the column that contains your data.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
2

There is a better way to solve this problem. An outlier is not any point over the 95th percentile or below the 5th percentile. Instead, an outlier is considered so if it is below the first quartile – 1.5·IQR or above third quartile + 1.5·IQR.
This website will explain in more thoroughly

To know more about outlier treatment refer here

capOutlier <- function(x){
   qnt <- quantile(x, probs=c(.25, .75), na.rm = T)
   caps <- quantile(x, probs=c(.05, .95), na.rm = T)
   H <- 1.5 * IQR(x, na.rm = T)
   x[x < (qnt[1] - H)] <- caps[1]
   x[x > (qnt[2] + H)] <- caps[2]
   return(x)
}
df$colName=capOutlier(df$colName)
Do the above line over and over for all of the columns in your data frame
Prateek Sharma
  • 1,371
  • 13
  • 11
  • That is a rigid definition of an outlier. Whether you define the outlier definition at below 20% / above 80%+ (as you have defined) or below 5% / above 95%+ (as the OP) is arbitrary; what works will depend on your problem and data. – ctbrown Jan 19 '19 at 20:48
  • I didn't define it as below 20% or above 80%. I used a common definition of an outlier that will probably be used in an introduction to statistics class. Anything less the first quartile - 1.5 * the interquartile range or above the third quartile + 1.5 * the interquartile range is considered an outlier. The interquartile range(IQR) is the range between the first quartile and the third quartile (the middle 50% of the data). – Kyle Peters Jan 21 '19 at 20:48
  • That is not a "common" definition of what an outlier is. It is an **arbitrary** one. – ctbrown Jan 23 '19 at 12:42
  • If you take a 101 statistics class in college, they will give you this definition of what an outlier is. Check the website in my answer. There are other definitions of what an outlier is, but this is the most basic and most used one. And, the definition I posted is more accurate than the one given in the question. If you had the data (.99998,1,1,1,1,1,1,1,1.0001), then .99998 and 1.0001 would be classified wrongly as outliers if you used the outlier classification method described in the question. – Kyle Peters Feb 04 '19 at 18:44