Given a large dataset, how do you remove outliers using IQR method using R

Question

We have been given a large dataset, and we are being asked to remove outliers using the IQR method using R.

The data has 53 columns, 17 of which are continuous, the remaining are categorical. How would you use the IQR method to remove outliers to multiple columns, and update the data frame?

I found this function online, but it does not account for columns that have categorical values.

sample_data <- data.frame(x=c(1, 2, 3, 4, 3, 2, 3, 4, 4, 5, 0),
                           y=c(4, 3, 5, 7, 8, 5, 9, 7, 6, 5, 0),
                           z=c(1, 3, 2, 9, 8, 7, 0, 8, 7, 2, 3))
print("Display original dataframe")
print(sample_data)

detect_outlier <- function(x) {
  
  # calculate first quantile
  Quantile1 <- quantile(x, probs=.25)
  
  # calculate third quantile
  Quantile3 <- quantile(x, probs=.75)
  
  # calculate inter quartile range
  IQR = Quantile3-Quantile1
  
  # return true or false
  x > Quantile3 + (IQR*1.5) | x < Quantile1 - (IQR*1.5)
}

# create remove outlier function
remove_outlier <- function(dataframe,
                           columns=names(dataframe)) {
  
  # for loop to traverse in columns vector
  for (col in columns) {
    
    # remove observation if it satisfies outlier function
    dataframe <- dataframe[!detect_outlier(dataframe[[col]]), ]
  }
  
  # return dataframe
  print("Remove outliers")
  print(dataframe)
}

remove_outlier(sample_data, c('x', 'y', 'z', 'w'))




########these are all my variables

'car.deliver.airport.num','car.deliver.hotel.num','car.deliver.train.station.num','car.displayed.turo.review.num',  'car.displayed.turo.review.num.past.12m','car.displayed.turo.review.num.past.18m','car.displayed.turo.review.num.past.6m','car.displayed.user.review.num','car.displayed.user.review.num.past.12m','car.displayed.user.review.num.past.18m','car.displayed.user.review.num.past.6m','car.extra.mile.fee','car.extra.num','car.extra.phone.mount','car.extra.portable.gps','car.extra.post.trip.cleaning','car.extra.prepaid.ev.recharge','car.extra.prepaid.refuel','car.extra.stroller','car.extra.unlimited.mileage','car.faq.num','car.instant.book','car.insurance','car.miles.included','car.photo.num','car.trip.price','host.car.num'



Original output
#Output after
#   x y z
#1  1 4 1
#2  2 3 3
#3  3 5 2
#4  4 7 9
#5  3 8 8
#6  2 5 7
#7  3 9 0
#8  4 7 8
#9  4 6 7
#10 5 5 2
#11 0 0 3

#Output after
#   x y z
#1  1 4 1
#2  2 3 3
#3  3 5 2
#4  4 7 9
#5  3 8 8
#6  2 5 7
#7  3 9 0
#8  4 7 8
#9  4 6 7
#10 5 5 2

I'm expecting for outliers to be removed from the original dataframe, for only continuous variables.

@akrun just added an example.. the example works, but my actual dataset is way bigger. i included all the variables. some are continuous, some are categorical. — dataanalyst, Dec 07 '22 at 17:27
When you loop, before doing anything else, check that col is numeric ... — kjetil b halvorsen, Dec 07 '22 at 17:31

akrun · Answer 1 · 2022-12-07T17:36:04.830

0

We may do this easily in tidyverse i.e. loop across the columns that are numeric (where(is.numeric)) and replace the values that are outliers by applying the detect_outlier to NA). Removing the values results in length difference across columns and then it can be saved only as a list as data.frame/tibble requires all columns to be of same length

library(dplyr)
sample_data %>% 
   mutate(across(where(is.numeric), ~ replace(.x, detect_outlier(.x), NA)))

edited Dec 07 '22 at 17:36

answered Dec 07 '22 at 17:34

akrun

874,273
37
540
662

thank you! but we are actually supposed to use the IQR method per prof directions – dataanalyst Dec 07 '22 at 17:35
@dataanalyst if the column name is `prof`, then `sample_data %>% group_by(prof) %>% mutate(across(where(is.numeric), ~ replace(.x, detect_outlier(.x), NA))) %>% ungroup` – akrun Dec 07 '22 at 17:36

score 0 · Answer 2 · answered Dec 07 '22 at 17:45

0

First select only the numeric columns, here a simple example:

DF <- data.frame(x=rnorm(10),y=sample(1:100,10),
                 z=factor(sample(LETTERS[1:2],10,replace=TRUE)))

select <- sapply(DF, is.numeric, simplify=TRUE) 

DF2 <- DF[, select, drop=FALSE]

Then applying the function removing outliers ...

answered Dec 07 '22 at 17:45

kjetil b halvorsen

1,206
2
18
28

Ohh Thank you! this works, however, some of the categories are numerical, like 0 = True, 1 = False. What if I referenced an excel doc that had the variables differentiated, how can i make that adjustment to your code? – dataanalyst Dec 07 '22 at 17:52
I'm not sure I understand ... you mean some numerical cols are really representing boolean? – kjetil b halvorsen Dec 07 '22 at 17:54
yes, exactly, some numerical columns are representing a boolean – dataanalyst Dec 07 '22 at 17:56
That a numerical column **represents** non-numerical values cannot be detected programmatically! So just add a variable , say repno, that contains FALSE for this columns and do something like `select <- select & repno` – kjetil b halvorsen Dec 07 '22 at 18:00

Given a large dataset, how do you remove outliers using IQR method using R

2 Answers2