0

I'm trying to cluster my data using Gaussian Mixture Model in sparklyr:

ml_gaussian_mixture(formula= ~ var1 + var2 + var3 + var4 + var5, k = 5)

However, calling this function doesn't return a metric to evaluate the number of clusters as ml_kmeans() does (this function returns WSSSE). Is there a way to get the Silhouette score or BIC for ml_gaussian_mixture() in sparklyr ?

Egodym
  • 453
  • 1
  • 8
  • 23

1 Answers1

0

With

gmm_model <- ml_gaussian_mixture(iris_tbl, Species ~ .)

you can get the log-likelihood as

gmm_model$summary$log_likelihood

Which you can then use to get BIC or AIC.

I'm sure there must be a way to get it directly though. But if not, you may calculate BIC as

log(n) + k-1 + k * p + k * p * (p-1) / 2 - 2 * gmm_model$summary$log_likelihood

Where n - number of samples, k - number of clusters, p- number of variables. In above, the k-1 + k * p + k * p * (p-1) / 2 is the number of free-parameters in a Gaussian mixture model (with unristricted co-variance matrices)


Example:

library(sparklyr)
sc <- spark_connect(master = "local")
iris_tbl <- sdf_copy_to(sc, iris, name = "iris_tbl", overwrite = TRUE)
gmm_model <- ml_gaussian_mixture(iris_tbl, Species ~ .)

gmm_model$summary$log_likelihood
#[1] -294.1398
kangaroo_cliff
  • 6,067
  • 3
  • 29
  • 42