0

I have a data frame

df=data.frame(a=1:5,b=c(1,3,4,-999999,4))

and want to calculate the average of it, but want to exlcude the outlier, -9999 in this case

Is there a command similar to

na.rm

so I can go

apply(df,1,function(x) mean(if x>-100 & x <100))

note that my df if 600 x 50 dim

frank
  • 3,036
  • 7
  • 33
  • 65
  • 1
    `mean` has a trim argument that will remove some percentage of the extreme observations. This may be helpful. – lmo Jun 17 '16 at 11:18
  • that is a nice idea, but it only removes values from the end: observations to be trimmed from each end of x before the mean is computed, and my df has many outliers in many possible positions – frank Jun 17 '16 at 11:21
  • When the help file says "each end of x," it means each end of `sort(x)`. So, for example: `df=data.frame(a=1:5,b=c(1,3,4,-999999,4), c=16:20); apply(df,1,function(x) mean(x, trim=.7))` will trim the max and min and return the median. – lmo Jun 17 '16 at 11:25
  • Are you looking for the mean of the whole data frame or row- or column-wise means? If the whole data frame, you can just do `mean(df[df > -100 & df < 100])`, which returns 3. – ulfelder Jun 17 '16 at 11:26
  • 1
    @alex I posted a solution below. Have you tried that? – akrun Jun 17 '16 at 11:31
  • @ ulfelder I will eventually seek the mean of the whole df, but need to exclude all outliers, so planned to take the mean of the apply fcn – frank Jun 17 '16 at 11:31
  • Your code and your expected output is not clear. – akrun Jun 17 '16 at 11:45

3 Answers3

2

We can replace the values in 'df' based on the logical condition to NA and use rowMeans.

rowMeans(replace(df, !(df < 100 & df > -100) , NA), na.rm=TRUE)
#[1] 1.0 2.5 3.5 4.0 4.5

Or as @RHertel mentioned in the comments

rowMeans(replace(df, abs(df) > 100 , NA), na.rm=TRUE) 
#[1] 1.0 2.5 3.5 4.0 4.5

Or another option would be to loop through the columns of 'df', get the outlier observations with boxplot.stats, convert the outliers to NA and do the rowMeans.

rowMeans(sapply(df, function(x)  x*NA^(x %in% boxplot.stats(x)$out)), na.rm=TRUE)
#[1] 1.0 2.5 3.5 4.0 4.5
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    I like the `boxplot.stats()` option. Good way to detect outliers; much better than a manual threshold. – RHertel Jun 17 '16 at 11:32
  • > rowMeans(replace(df, df < 100 & df > -100, NA), na.rm=TRUE) [1] NaN NaN NaN -999999 NaN Not what I am looking for – frank Jun 17 '16 at 11:34
  • @alex Based on the example showed, it works for me. In your original data, there may not be any observation in a row that meets that the condition, so NaN. For that you cannot blame the solution. – akrun Jun 17 '16 at 11:35
  • odd indeed, it did not work right now with the df given – frank Jun 17 '16 at 11:36
  • @alex Sorry, the condition should been negated in the first one. Fixed – akrun Jun 17 '16 at 11:40
2

If you want the mean of the whole data frame, you can cut to the chase and use:

mean(df[df > -100 & df < 100])

When I apply that to the df you gave, I get the expected answer: 3.

ulfelder
  • 5,305
  • 1
  • 22
  • 40
0

You write your own function excluding -9999 value for your case

y<-function(x){mean(x[!grepl("\\-999",x)])}

y(df$b)

[1] 3
Arun kumar mahesh
  • 2,289
  • 2
  • 14
  • 22