I have a dataframe with 30 columns and >10,000 rows.
How can I run an outlier analysis for a set of variables that will return a TRUE if ANY of the variables exceed the particular threshold (for that given variable), or FALSE if the respective outlier thresholds (3SDs) are not met for any of the variables, with the TRUE/FALSE values displaying in a new column?
I have used quantile to find the 3 standard deviation cut-off values for each variable:
i.e.:
quantile(df$a, 0.003, na.rm = T) #and
quantile(df$a, 0.997, na.rm = T)
say the first value is 2.5 and the upper value is 10.5 for this variable, I then have created a new variable:
df$outliers <- (df$a <- df$a <2.5 | df$a > 10.5)
which gives TRUE values when values in column a are less than 2.5 or greater than 10.5.
What I would like to do, is have df$outliers represent the outlier status for a set of columns, not just one, i.e columns d, e, f, g, l, m etc, which will all have their own threshold values to meet.
What is the best way to do this?