-1

I have a data frame with 431 variables and 140 observations and I need to remove outliers. However this dataset has several NA values, and I do not want to remove all rows with NAs. I am trying to do this outlier removal by IQR method, and so far, I've been able to obtain quartiles and IQR by the following:

data <- df2[,4:434]
apply(data,2,quantile, probs=c(0.25,0.75), na.rm=TRUE) -> Quartiles
sapply(data,IQR, na.rm=TRUE) -> iqr

I've also calculated the lower and upper values for each of my columns:

Lower <- Quartiles[1,]-1.5*iqr
Upper <- Quartiles[2,]+1.5*iqr

However, when I have tried to replace the outliers by NAs, no change has been observed in my data frame:

data_no_outlier <- replace(data, data[1:431] < Lower  & data[1:431] > Upper, NA)

I have also tried to use this script to the iris data with the same unsuccessful result:

data(iris, package = "datasets")
completeData <- iris[-5]
apply(completeData,2,quantile, probs=c(0.25,0.75), na.rm=TRUE) -> Quartiles
sapply(completeData,IQR, na.rm=TRUE) -> iqr

Lower <- Quartiles[1,]-1.5*iqr
Upper <- Quartiles[2,]+1.5*iqr

data_no_outlier <- replace(completeData, completeData < Lower & completeData > Upper, NA)

Is there any way I can filter out outliers from my data, that does not require to manually select all the columns by name?

Nat23
  • 3
  • 1
  • That is the first row of the quartiles data, so I've used the Quartile file and IQR values, to calculate the lower and upper values. So now, each of those files is composed of a row with 4 values (one for each variable/column) – Nat23 Sep 30 '22 at 14:14

1 Answers1

0

Here's one method:

fun <- function(z, fac = 1.5, na.rm = TRUE) {
  Q <- quantile(z, c(0.25, 0.75), na.rm = na.rm)
  R <- IQR(z, na.rm = na.rm)
  z[z < Q[1] - fac * R | z > Q[2] + fac * R] <- NA
  z
}

Sample data:

set.seed(42)
quux <- data.frame(ltr = letters[1:10], num1 = c(99, runif(9)), num2 = c(runif(9), 99))
quux
#    ltr       num1       num2
# 1    a 99.0000000  0.7050648
# 2    b  0.9148060  0.4577418
# 3    c  0.9370754  0.7191123
# 4    d  0.2861395  0.9346722
# 5    e  0.8304476  0.2554288
# 6    f  0.6417455  0.4622928
# 7    g  0.5190959  0.9400145
# 8    h  0.7365883  0.9782264
# 9    i  0.1346666  0.1174874
# 10   j  0.6569923 99.0000000

dplyr

library(dplyr)
quux %>%
  mutate(across(where(is.numeric), fun))
#    ltr      num1      num2
# 1    a        NA 0.7050648
# 2    b 0.9148060 0.4577418
# 3    c 0.9370754 0.7191123
# 4    d 0.2861395 0.9346722
# 5    e 0.8304476 0.2554288
# 6    f 0.6417455 0.4622928
# 7    g 0.5190959 0.9400145
# 8    h 0.7365883 0.9782264
# 9    i 0.1346666 0.1174874
# 10   j 0.6569923        NA

base R

isnum <- sapply(quux, is.numeric)
quux[isnum] <- lapply(quux[isnum], fun)
quux
#    ltr      num1      num2
# 1    a        NA 0.7050648
# 2    b 0.9148060 0.4577418
# 3    c 0.9370754 0.7191123
# 4    d 0.2861395 0.9346722
# 5    e 0.8304476 0.2554288
# 6    f 0.6417455 0.4622928
# 7    g 0.5190959 0.9400145
# 8    h 0.7365883 0.9782264
# 9    i 0.1346666 0.1174874
# 10   j 0.6569923        NA
r2evans
  • 141,215
  • 6
  • 77
  • 149