na.rm for other values in R

Question

I have a data frame

df=data.frame(a=1:5,b=c(1,3,4,-999999,4))

and want to calculate the average of it, but want to exlcude the outlier, -9999 in this case

Is there a command similar to

na.rm

so I can go

apply(df,1,function(x) mean(if x>-100 & x <100))

note that my df if 600 x 50 dim

`mean` has a trim argument that will remove some percentage of the extreme observations. This may be helpful. — lmo, Jun 17 '16 at 11:18
that is a nice idea, but it only removes values from the end: observations to be trimmed from each end of x before the mean is computed, and my df has many outliers in many possible positions — frank, Jun 17 '16 at 11:21
When the help file says "each end of x," it means each end of `sort(x)`. So, for example: `df=data.frame(a=1:5,b=c(1,3,4,-999999,4), c=16:20); apply(df,1,function(x) mean(x, trim=.7))` will trim the max and min and return the median. — lmo, Jun 17 '16 at 11:25
Are you looking for the mean of the whole data frame or row- or column-wise means? If the whole data frame, you can just do `mean(df[df > -100 & df < 100])`, which returns 3. — ulfelder, Jun 17 '16 at 11:26
@ ulfelder I will eventually seek the mean of the whole df, but need to exclude all outliers, so planned to take the mean of the apply fcn — frank, Jun 17 '16 at 11:31

akrun · Answer 1 · 2016-06-17T11:41:18.703

2

We can replace the values in 'df' based on the logical condition to NA and use rowMeans.

rowMeans(replace(df, !(df < 100 & df > -100) , NA), na.rm=TRUE)
#[1] 1.0 2.5 3.5 4.0 4.5

Or as @RHertel mentioned in the comments

rowMeans(replace(df, abs(df) > 100 , NA), na.rm=TRUE) 
#[1] 1.0 2.5 3.5 4.0 4.5

Or another option would be to loop through the columns of 'df', get the outlier observations with boxplot.stats, convert the outliers to NA and do the rowMeans.

rowMeans(sapply(df, function(x)  x*NA^(x %in% boxplot.stats(x)$out)), na.rm=TRUE)
#[1] 1.0 2.5 3.5 4.0 4.5

edited Jun 17 '16 at 11:41

answered Jun 17 '16 at 11:21

akrun

874,273
37
540
662

1

I like the `boxplot.stats()` option. Good way to detect outliers; much better than a manual threshold. – RHertel Jun 17 '16 at 11:32
> rowMeans(replace(df, df < 100 & df > -100, NA), na.rm=TRUE) [1] NaN NaN NaN -999999 NaN Not what I am looking for – frank Jun 17 '16 at 11:34
@alex Based on the example showed, it works for me. In your original data, there may not be any observation in a row that meets that the condition, so NaN. For that you cannot blame the solution. – akrun Jun 17 '16 at 11:35
odd indeed, it did not work right now with the df given – frank Jun 17 '16 at 11:36
@alex Sorry, the condition should been negated in the first one. Fixed – akrun Jun 17 '16 at 11:40

score 2 · Accepted Answer · answered Jun 17 '16 at 11:37

2

If you want the mean of the whole data frame, you can cut to the chase and use:

mean(df[df > -100 & df < 100])

When I apply that to the df you gave, I get the expected answer: 3.

answered Jun 17 '16 at 11:37

ulfelder

5,305
1
22
40

score 0 · Answer 3 · answered Jun 17 '16 at 11:48

0

You write your own function excluding -9999 value for your case

y<-function(x){mean(x[!grepl("\\-999",x)])}

y(df$b)

[1] 3

answered Jun 17 '16 at 11:48

Arun kumar mahesh

2,289
2
14
22

na.rm for other values in R

3 Answers3