Methods to remove outliers from data using R

Question

I have remove outliers in the modeling data. I am tired trying all methods for removing as there is an outlier that i troubling me a lot after applying many methods .

can anyone pleas help me on this..... please..

I hv used winzorise,outliers,extremevalues packeges etc, yet could nt remove outliers

The data has 50000 cutomers and 32 attributes.

The data has both numeric and non numeric data

I am not able to attach the data set here.

please help me

Extra information:

I am more than worried since its my dissertation i have no idea how to deal with outliers..

If u know anything that works please post...

Data is available on net, i can not post it here , sorry....

and my supervisor need a plot with no outlier.. and also the entire data slot present for the outliers data. I don't know how to do it for all the combinations of variables and pick outliers and plot without any outliers in the graph.

I have no idea how to do it. I cant post pictures or snap shots of data since reputation is <10

How do you see that not all outliers have been removed? The method you use to check if they are still there can probably also be used to remove them. — Vincent Zoonekynd, Jun 15 '13 at 16:36
@Vincent Zoonekynd - I used outliers and extremevalues packages but the the outliers were like it were before using them. — Pavithra C Reddy, Jun 16 '13 at 10:12
@Michele- Thank you so much.. I am not able to post pictures nor data nor snapshots at the moment please try something and help me.. — Pavithra C Reddy, Jun 16 '13 at 10:27

score 1 · Accepted Answer · answered Jun 16 '13 at 11:51

1

Without more information about your data and your results so far, you will only get very general answers. For instance, there is a chapter on outlier detection in Y. Zhao's R and Data Mining that may be useful.

If your dataset is this one, most of the variables are qualitative: it may be sufficient to look at each variable separately, and consider rare classes as outliers. A few more algorithms are listed in this article.

It could also be that there are no outliers to worry about.

answered Jun 16 '13 at 11:51

Vincent Zoonekynd

31,893
5
69
78

Thanks Vincent , thats the data.. I am not able to post my work here due to low reputation. The issue is i need to plot a scatter plot , i used plot(x$V5,x$V28) These are for age and personal income variables. I found 16 outliers in this plot, I need to remove them and have a outlier free plot. I also need to extract all the outliers in the data set and analyse these as special cases further. I dont know how to do that.. Hope u understood.. – Pavithra C Reddy Jun 16 '13 at 12:36
These are univariate outliers: all salaries are below 50,000, except those 16, which are above 500,000. You can look at the data, one variable at a time (`hist(d$V28)`, `boxplot(x$V28)`, `tail(sort(x$V28),20)`, etc.) and decide which observations to discard. – Vincent Zoonekynd Jun 16 '13 at 12:52
Vincent how do i discard the observation... I know i have outliers, i dont know to discard..Thats the problem.. I am trying outliers and extremevalues, doesnt wrk.. Help me with this , id u can pls.. I amtrying r packages to do so, but it doesnt wrk.. – Pavithra C Reddy Jun 16 '13 at 17:47
Do u mean manually or is there another way i can do that.. Please do let me know.. – Pavithra C Reddy Jun 16 '13 at 18:06
Yes, manually: `i <- d$PERSONAL_NET_INCOME < 50000; filtered_dataset <- d[i,]; outliers <- d[!i,]`. – Vincent Zoonekynd Jun 16 '13 at 18:27
Thanks a lot!!!!!!!! - Vincent one more question please,.. but i still hv a single outlier when i use plot , so how do analyse the correlation between age and net income ... – Pavithra C Reddy Jun 16 '13 at 21:50
In the data I have, there are no outliers left, but you can always change the threshold. However, computing the correlation will not be very meaningful, for two reasons: first, the distribution of income is highly skewed (this can be remedied by taking the logarithm of the income), second, it has many zeroes, that should probably be treated separately. The rest of the data shows a positive correlation: `library(hexbin); plot( hexbin(d$AGE[i], log(1+d$PERSONAL_NET_INCOME[i])) )`. – Vincent Zoonekynd Jun 16 '13 at 22:50
Vincent do you mind telling me how to plot a barplot seperately for good and bad customers for Net income (in PAKDD2009, d$V28 in R) with target variable Good_bad (in PAKDD2009, d$V32 in R) . Please let me know how to do this.. – Pavithra C Reddy Jun 17 '13 at 13:21

score 0 · Answer 2 · answered Jun 15 '13 at 16:22

0

Your data is multivariate so you can use cov.mcd and cov.mve for minimum covariance determinant and minimum volume ellipsoid estimators. Then calculate mahalonabis distances using one of these covariance estimates. Squared mahalonobis distances which are above a critical value can be considered big and corresponding observations can be labeled as outliers. Use quantile of chisquare distribution with degree of freedom of p where p is the number of variables.

Edit: cov.mcd and cov.mve are defined in package MASS

answered Jun 15 '13 at 16:22

jbytecode

681
12
29

could please be more detailed, i need to do data exploration, deal with outliers, build classification models... I am new to R and data mining. unfortunately my dissertation on this data and i get no help from my tutor either.. Sorry for the trouble.. Please help me – Pavithra C Reddy Jun 15 '13 at 16:55

Methods to remove outliers from data using R

2 Answers2