0

I have a large dataset and have defined outliers to be those values which fall either above the 99th or below the 1st percentile.

I'd like to take the mean of those outliers with their previous and following datapoints, then replace all 3 values with that average in a new dataset.

If there's anyone who knows how to do this I'd be very grateful for a response.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Bobbo
  • 95
  • 1
  • 1
  • 5

1 Answers1

4

If you have a list of indices specifying the outliers location in the vector, e.g. using:

out_idx = which(df$value > quan0.99)

You can do something like:

for(idx in out_idx) {
  vec[(idx-1):(idx+1)] = mean(vec[(idx-1):(idx+1)])
}

You can wrap this in a function, making the bandwith and the function an optional parameter:

average_outliers = function(vec, outlier_idx, bandwith, func = "mean") {
   # iterate over outliers
   for(idx in out_idx) {
    # slicing of arrays can be used for extracting information, or in this case,
    # for assiging values to that slice. do.call is used to call the e.g. the mean 
    # function with the vector as input.
    vec[(idx-bandwith):(idx+bandwith)] = do.call(func, out_idx[(idx-bandwith):(idx+bandwith)])
  }      
  return(vec)
}

allowing you to also use median with a bandwith of 2. Using this function:

# Call average_outliers multiple times on itself,
# first for the 0.99 quantile, then for the 0.01 quantile.
vec = average_outliers(vec, which(vec > quan0.99))
vec = average_outliers(vec, which(vec < quan0.01))

or:

vec = average_outliers(vec, which(vec > quan0.99), bandwith = 2, func = "median")
vec = average_outliers(vec, which(vec < quan0.01), bandwith = 2, func = "median")

to use a bandwith of 2, and replace with the median value.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • `vec` is the original dataset, e.g. in the call `vec = average_outliers(vec, which(vec > quan0.99))` `vec` is both input and output, on the rhs it is the original dataset, on the lhs it is the new dataset. You could also give the return value of `average_outliers` a new name, i.e. put it in a new variable: `vec2 = average_outliers(vec, which(vec > quan0.99))`. – Paul Hiemstra Nov 12 '12 at 09:15
  • I added some comments that might shed more light on my thoughprocess. – Paul Hiemstra Nov 12 '12 at 09:18
  • @RobS no problem, a good question leads to a good answer. Next time, it would be even better if you include a reproducible example, in your case a timeseries in which you want to replace the outliers. – Paul Hiemstra Nov 12 '12 at 09:45