I am using h2o kmeans in R to divide my population. The method need to be audited, so I would like to explain the threshold used in the h2o's kmeans.
In the documentation of h2o kmeans (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/k-means.html), it is said :
H2O uses proportional reduction in error (PRE) to determine when to stop splitting. The PRE value is calculated based on the sum of squares within (SSW).
PRE=(SSW[before split]−SSW[after split])/SSW[before split]
H2O stops splitting when PRE falls below a threshold, which is a function of the number of variables and the number of cases as described below:
threshold takes the smaller of these two values:
either 0.8 or [0.02 + 10/number_of_training_rows + 2.5/(number_of_model_features)^2]
The source code (https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/kmeans/KMeans.java) is given as :
final double rel_improvement_cutoff = Math.min(0.02 + 10. / _train.numRows() + 2.5 / Math.pow(model._output.nfeatures(), 2), 0.8);
Where does this threshold come from ? Are there scientific papers about it ?