4

Problem

I have a dataframe that composes of > 5 variables at any time and am trying to do a K-Means of it. Because K-Means is greatly affected by outliers, I've been trying to look for a few hours on how to calculate and remove multivariate outliers. Most examples demonstrated are with 2 variables.


Possible Solutions Explored

Issues thus Far

Regarding mvoutlier, I was unable to generate a result because it noted my dataset contained negatives and it could not work because of that. I'm not sure how to alter my data to only positive since I need negatives in the set I am working with.

Regarding Another Outlier Detection Method I was able to come up with a list of outliers, but am unsure how to exclude them from the current data set. Also, I do know that these calculations are done after K-Means, and thus I probably will apply the math prior to doing K-Means.


Minimal Verifiable Example

Unfortunately, the dataset I'm using is off-limits to be shown to anyone, so what you'll need is any random data set with more than 3 variables. The code below is code converted from the Another Outlier Detection Method post to work with my data. It should work dynamically if you have a random data set as well. But it should have enough data where cluster center amount should be okay with 5.

clusterAmount <- 5
cluster <- kmeans(dataFrame, centers = clusterAmount, nstart = 20)
centers <- cluster$centers[cluster$cluster, ]
distances <- sqrt(rowSums(clusterDataFrame - centers)^2)
m <- tapply(distances, cluster$cluster, mean)
d <- distances/(m[cluster$cluster])

# 1% outliers
outliers <- d[order(d, decreasing = TRUE)][1:(nrow(clusterDataFrame) * .01)]

Output: A list of outliers ordered by their distance away from the center they reside in I believe. The issue then is getting these results paired up to the respective rows in the data frame and removing them so I can start my K-Means procedure. (Note, while in the example I used K-Means prior to removing outliers, I'll make sure to take the necessary steps and remove outliers before K-Means upon solution).


Question

With Another Outlier Detection Method example in place, how do I pair the results with the information in my current data frame to exclude those rows before doing K-Means?

Jon
  • 89
  • 1
  • 8

1 Answers1

4

I don't know if this is exactly helpful but if your data is multivariate normal you may want to try out a Wilks (1963) based method. Wilks showed that the mahalanobis distances of multivariate normal data follow a Beta distribution. We can take advantage of this (iris Sepal data used as an example):

test.dat <- iris[,-c(1,2))]

Wilks.function <- function(dat){
  n <- nrow(dat)
  p <- ncol(dat)
  # beta distribution
  u <- n * mahalanobis(dat, center = colMeans(dat), cov = cov(dat))/(n-1)^2
  w <- 1 - u
  F.stat <- ((n-p-1)/p) * (1/w-1) # computing F statistic
  p <- 1 - round( pf(F.stat, p, n-p-1), 3) # p value for each row
  cbind(w, F.stat, p)
}

plot(test.dat, 
     col = "blue", 
     pch = c(15,16,17)[as.numeric(iris$Species)])

dat.rows <- Wilks.function(test.dat); head(dat.rows)
#                 w    F.stat     p
#[1,] 0.9888813 0.8264127 0.440
#[2,] 0.9907488 0.6863139 0.505
#[3,] 0.9869330 0.9731436 0.380
#[4,] 0.9847254 1.1400985 0.323
#[5,] 0.9843166 1.1710961 0.313
#[6,] 0.9740961 1.9545687 0.145

Then we can simply find which rows of our multivariate data are significantly different from the beta distribution.

outliers <- which(dat.rows[,"p"] < 0.05)

points(test.dat[outliers,], 
       col = "red", 
       pch = c(15,16,17)[as.numeric(iris$Species[outliers])])

outliers

Evan Friedland
  • 3,062
  • 1
  • 11
  • 25
  • Nice answer. But how can we know if the 150 variables are multivariate normally distributed? Otherwise, is there a transformation to reshape the multivariate distribution into a normal? – Seymour Jan 11 '18 at 12:26
  • 1
    That is a great Cross Validated question. See some of the following: [Multivariate distribution](https://stats.stackexchange.com/questions/187460/multivariate-distribution) [How do you detect if a given dataset has multivariate normal distribution?](https://stats.stackexchange.com/questions/41348/how-do-you-detect-if-a-given-dataset-has-multivariate-normal-distribution) Also, if you are looking at a dataset of 150 variables, check multicollinearity. [assumptions to derive ols estimator](https://stats.stackexchange.com/questions/149110/assumptions-to-derive-ols-estimator/149111#149111) – Evan Friedland Jan 11 '18 at 12:56
  • Thank you. I am in the context of clustering, therefore the second link is out of scope, right? – Seymour Jan 11 '18 at 13:22
  • 1
    I've done a quick search since I'm not as knowledgeable on clustering methods but it appears that some say yes, some say nah - to whether or not to remove some correlated variables. I'll drop some here: [1](https://stats.stackexchange.com/questions/62253/do-i-need-to-drop-variables-that-are-correlated-collinear-before-running-kmeans) [2](https://stats.stackexchange.com/questions/50537/should-one-remove-highly-correlated-variables-before-doing-pca/50583#50583) -- I essentially googled "Is clustering affected by multicollinearity" - sorry I can't be of more help – Evan Friedland Jan 11 '18 at 13:30