1

I'm having some problem with a function that i made in R (I'm a beginner with this language). This is a function to remove outliers in a dataset:

removalOutlier <- function(data){
  q1 <- 0
  q2 <- 0
  q3 <- 0

 for(j in 1:(ncol(data)-1)){
   m <- length(data[,j])

   J<-0
   data<-data[order(data[,j]),]
   data<-as.matrix(data)
   print(data)
   q2 <- median(data[,j])

if((m %% 2) != 0){     
  q1 <- median(data[1:(((m+1)/2)-1),j]) #m dispari
  q3 <- median(data[(((m+1)/2)+1):m,j])
   } else {  
   q1 <- median(data[1:(m/2),j]) #m pari
   q3 <- median(data[(m+2)/2:m,j])     
  }

    iqr = q3-q1

    for(k in 1:(length(data[,j]))){
    print(data[k,j])

    if((data[k,j] > (q3+(1.5*iqr))) | (data[k,j] < (q1-(1.5*iqr))))

    {
     J[k]<-k
     data <- as.matrix(data) 
     }
     else 
     {J[k]<-0}        
   }

 data <- as.data.frame(data)

  for(z in 1:length(J)){
    if(J[z]!=0){

    data<-data.frame(data[-J[z],])
    } 
  }
 print (data)

}
data 
}

This is the ordered output related to the first column of the dataset. Dividing the first column of the dataset in quartiles, I found: Q1-1.5*IQR=117 and Q3+1.5*IQR=3609. Since I have these results, function should remove the last three samples of the following list:

     V1 V2     V3
45  852  2 179900
32 1000  1 169900
26 1100  3 249900
44 1200  3 299000
47 1203  3 239500
18 1236  3 199900
37 1239  3 229900
15 1268  3 259900
17 1320  2 299900
9  1380  3 212000
4  1416  2 232000
8  1427  3 198999
36 1437  3 249900
27 1458  3 464500
10 1494  3 242500
7  1534  3 314900
2  1600  3 329900
23 1604  3 242900
41 1664  2 368500
21 1767  3 252900
35 1811  4 285900
31 1839  2 349900
46 1852  4 299900
22 1888  2 255000
13 1890  3 329999
11 1940  4 239999
24 1962  4 259900
6  1985  4 299900
12 2000  3 347000
33 2040  4 314900
1  2104  3 399900
38 2132  4 345000
40 2162  4 287000
29 2200  3 475000
42 2238  3 329900
16 2300  4 449900
3  2400  3 369000
28 2526  3 469000
43 2567  4 314000
19 2609  4 499998
30 2637  3 299900
5  3000  4 539900
20 3031  4 599000
34 3137  3 579900
25 3890  3 573900
14 4478  5 699900
39 4515  4 549000

but it removes:

25 3890  3 573900
39 4515  4 549000

and not :

14 4478  5 699900

I don't understand why. When I pass to check the presence of outliers in the second column, I find them and remove them without any problem.

Dave2e
  • 22,192
  • 18
  • 42
  • 50
Mick
  • 87
  • 1
  • 11
  • 3
    At a glance, I'm not sure what's wrong with your code. In general, I'd advise you to look for built-in functions to do things that seem like they might be common tasks rather than coding your own. These functions have been well tested and will be more efficient both in your own time and in the runtime. For example, if you search for "iqr in R", you'll quickly find the `IQR()` function, which could replace many lines of your code. – Gregor Thomas Dec 05 '18 at 17:24
  • Could you make sample data available via `dput` function. Despite notable [advances in that field](https://stackoverflow.com/questions/13438556/how-do-i-copy-and-paste-data-into-r-from-the-clipboard), it's still a pain to quickly copy/paste data frames from SO to R. – Konrad Dec 05 '18 at 17:32
  • 2
    Take a look at this question, same problem, but using the boxplot function to perform the work. https://stackoverflow.com/questions/53201016/remove-outliers-in-r-very-easy/53209680#53209680 – Dave2e Dec 05 '18 at 17:36

2 Answers2

1

This code will remove the last 3 rows according to your IQR logic

bound <- quantile(df$V1, c(.25, .75)) + c(-1, 1)*1.5*IQR(df$V1)
out <- subset(df, V1 > bound[1] & V1 < bound[2])

Checking the results:

nrow(df) - nrow(out)
# [1] 3

data.table::fsetdiff(df, out)
#      V1 V2     V3
# 1: 3890  3 573900
# 2: 4478  5 699900
# 3: 4515  4 549000

Data used:

df <- data.table::fread('
     V1 V2     V3
45  852  2 179900
32 1000  1 169900
26 1100  3 249900
44 1200  3 299000
47 1203  3 239500
18 1236  3 199900
37 1239  3 229900
15 1268  3 259900
17 1320  2 299900
9  1380  3 212000
4  1416  2 232000
8  1427  3 198999
36 1437  3 249900
27 1458  3 464500
10 1494  3 242500
7  1534  3 314900
2  1600  3 329900
23 1604  3 242900
41 1664  2 368500
21 1767  3 252900
35 1811  4 285900
31 1839  2 349900
46 1852  4 299900
22 1888  2 255000
13 1890  3 329999
11 1940  4 239999
24 1962  4 259900
6  1985  4 299900
12 2000  3 347000
33 2040  4 314900
1  2104  3 399900
38 2132  4 345000
40 2162  4 287000
29 2200  3 475000
42 2238  3 329900
16 2300  4 449900
3  2400  3 369000
28 2526  3 469000
43 2567  4 314000
19 2609  4 499998
30 2637  3 299900
5  3000  4 539900
20 3031  4 599000
34 3137  3 579900
25 3890  3 573900
14 4478  5 699900
39 4515  4 549000
')[, -1]
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
1

The following function removes the outliers according to the question's criteria but the results are not the same as the OP's results.

removalOutlier2 <- function(data){
  f <- function(x){
    iqr <- IQR(x, na.rm = TRUE)
    qq <- quantile(x, c(1, 3)/4)
    lims <- qq + c(-1, 1)*1.5*iqr
    out <- which(x < lims[1] | lims[2] < x)
    x[out] <- median(x, na.rm = TRUE)
    x
  }
  data[] <- lapply(data, f)
  data
}

df2 <- removalOutlier2(data)

Edit.

The function's results are consistent with the results in the answer by IceCreamToucan.

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66