Removing outliers easily in R

Question

I have data with discrete x-values, such as

x = c(3,8,13,8,13,3,3,8,13,8,3,8,8,13,8,13,8,3,3,8,13,8,13,3,3)
y = c(4,5,4,6,7,20,1,4,6,2,6,8,2,6,7,3,2,5,7,3,2,5,7,3,2);

How can I generate a new dataset of x and y values where I eliminate pairs of values where the y-value is 2 standard deviations above the mean for that bin. For example, in the x=3 bin, 20 is more than 2 SDs above the mean, so that data point should be removed.

agstudy · Accepted Answer · 2013-03-01T17:06:47.990

7

for me you want something like :

 by(dat,dat$x, function(z) z$y[z$y < 2*sd(z$y)])
dat$x: 3
[1] 4 1 6 5 7 3 2
--------------------------------------------------------------------------------------------------------------- 
dat$x: 8
[1] 4 2 2 2 3
--------------------------------------------------------------------------------------------------------------- 
dat$x: 13
[1] 3 2

EDIT after comment :

 by(dat,dat$x, 
           function(z) z$y[abs(z$y-mean(z$y))< 2*sd(z$y)])

EDIT

I slightly change the by function to get x and y, then I call rbind using do.call

   do.call(rbind,by(dat,dat$x,function(z) {
                              idx <- abs(z$y-mean(z$y))< 2*sd(z$y)
                              z[idx,]
            }))

or using plyr in single call

 ddply(dat,.(x),function(z) {
                 idx <- abs(z$y-mean(z$y))< 2*sd(z$y)
                  z[idx,]})

edited Mar 01 '13 at 17:06

answered Mar 01 '13 at 15:00

agstudy

119,832
17
199
261

1

should that be `z$y < mean(z$y) + 2*sd(z$y)` ? when OP mentioned "the y-value is 2 standard deviations above the mean for that bin" – liuminzhao Mar 01 '13 at 15:02
@liuminzhao I update my answer. I think my mistake comes from the question formulation ( i need to improve my English:)) – agstudy Mar 01 '13 at 15:13
After the edit, it gives the same result as James' `tapply`-based James' solution – QkuCeHBH Mar 01 '13 at 15:56
And then how can I get this back into an x and y structure? (list of x and y values) – CodeGuy Mar 01 '13 at 16:51
@agstudy Does the `plyr` solution remove observations above 2 SD only? Or 2 SD above AND below? – radek Jun 03 '13 at 09:10

score 2 · Answer 2 · answered Mar 01 '13 at 14:50

2

Something like this?

newdata <- cbind(x,y)[-which(y>2*sd(y)), ]

Or you mean something like this?

Data <- cbind(x,y)
Data[-which(sd(y)>rowMeans(Data)), ]

answered Mar 01 '13 at 14:50

Jilber Urbina

58,147
10
114
138

This solution does not remove outliers in `y` by bin (*i.e.* separately for each value of `x`), but rather on a global scale – QkuCeHBH Mar 01 '13 at 16:01

score 2 · Answer 3 · answered Mar 01 '13 at 15:08

2

You can use tapply for this, but you will lose your original ordering.

tapply(y,x,function(z) z[abs(z-mean(z))<2*sd(z)])
$`3`
[1] 4 1 6 5 7 3 2

$`8`
 [1] 5 6 4 2 8 2 7 2 3 5

$`13`
[1] 4 7 6 6 3 2 7

answered Mar 01 '13 at 15:08

James

65,548
14
155
193

Then how can I restructure this into a list of x and y values? – CodeGuy Mar 01 '13 at 16:48

Removing outliers easily in R

3 Answers3

Linked