0

I am using h2o kmeans in R to divide my population. The method need to be audited, so I would like to explain the threshold used in the h2o's kmeans.

In the documentation of h2o kmeans (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html), it is said :

H2O uses proportional reduction in error (PRE) to determine when to stop splitting. The PRE value is calculated based on the sum of squares within (SSW).

PRE=(SSW[before split]−SSW[after split])/SSW[before split]

H2O stops splitting when PRE falls below a threshold, which is a function of the number of variables and the number of cases as described below:

threshold takes the smaller of these two values:

either 0.8 or [0.02 + 10/number_of_training_rows + 2.5/(number_of_model_features)^2]

The source code (https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/kmeans/KMeans.java) is given as :

final double rel_improvement_cutoff = Math.min(0.02 + 10. / _train.numRows() + 2.5 / Math.pow(model._output.nfeatures(), 2), 0.8);

Where does this threshold come from ? Are there scientific papers about it ?

A.Lag
  • 3
  • 3
  • Including links for that specific documentation and sources will not hurt, but add a better understanding of the context – emecas Mar 26 '18 at 14:53

1 Answers1

0

I am responsible for that threshold. I developed it by running numerous datasets -- artificial and real -- through the k-means algorithm. I began some years ago working with SSW improvement and testing it as a chi-square variable, as recommended by John Hartigan. This criterion failed in a number of instances, so I switched to PRE. The equation above is the result of fitting a nonlinear model to results on datasets with a known number of clusters. When I wrote the k-means program for Tableau, I used this same PRE criterion. After I left Tableau for H2O, they substituted the Calinski-Harabasz index for my PRE rule, producing similar results. Leland Wilkinson, Chief Scientist, H2O.