Filtering Variables within Cluster Analysis in R

Question

I am attempting to run a cluster analysis (PAM) on a financial dataset with a lot of noise.

There are well over 100 variables, many of which are highly collinear.

Running the clustering algorithm on the entire array of columns is almost nonsensical given the amount of noise and collinearity, and I do not wish to use a PCA because I will end up with components rather than ranges of existing variables for each cluster, which I plan to further analyze.

In assessing the clustering tendency (hopkin's statistic) of a defined group of say 10 variables, I can determine whether clustering is viable. My question is if there is a way to loop the hopkin's statistic across every possible group of say 10 variables, such that I can run the clustering algorithm on the group with the best hopkin's statistic, etc.

I may be way off base with this, but any advice is appreciated.

Don't rely on the Hopkins statistic. It's a simple test for uniformity, but not for multimodality. I.e., a single Gaussian will have a high "clustering tendency", but that likely will not be useful to you. — Has QUIT--Anony-Mousse, Aug 18 '18 at 07:18

score 0 · Answer 1 · answered Aug 16 '18 at 17:54

0

There is a package ‘clustertend’ and there is hopkin's statistics here as function https://cran.r-project.org/web/packages/clustertend/clustertend.pdf

answered Aug 16 '18 at 17:54

Nar

648
4
8

Thank you for the reply. My question is about the selection of the factors themselves. For instance, if I have 100 observations of 50 or so variables, is there a way to test which combination of variables returns the greatest hopkin's statistic? Right now I am stuck with manually testing combinations of variables to cluster observations by. – R.Bro Aug 16 '18 at 22:57

score 0 · Answer 2 · answered Aug 18 '18 at 07:21

Use a subspace clustering approach.

These algorithms attempt to identify both clusters and the variables that distinguish this cluster at the same time.

But even these algorithms will benefit if you reduce the number of variables. First try to identify highly correlated variables (duplicates), and useless variables (noise), and remove them.

Don't rely on the Hopkins statistic. It's a simple test for uniformity, but not for multimodality. I.e., a single Gaussian will have a high "clustering tendency", but that likely will not be useful to you. So the statistic will likely not help.

Filtering Variables within Cluster Analysis in R

2 Answers2