1

I implemented the default gmm model provided in mllib for my algorithm. I am repeatedly finding that the resultant weights are always equally waited no matter how many clusters i initiate. Is there any specific reason why the weights are not being adjusted ? Am I implementing it wrong ?

import org.apache.spark.mllib.clustering.GaussianMixture
import org.apache.spark.mllib.clustering.GaussianMixtureModel
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrameNaFunctions

var colnames= df.columns;
for(x<-colnames)
{   
    if (df.select(x).dtypes(0)._2.equals("StringType")|| df.select(x).dtypes(0)._2.equals("LongType"))
    {df = df.drop(x)}

}
colnames= df.columns;
var assembler = new VectorAssembler().setInputCols(colnames).setOutputCol("features")
var output = assembler.transform(df)
var normalizer= new Normalizer().setInputCol("features").setOutputCol("normalizedfeatures").setP(2.0)
var normalizedOutput = normalizer.transform(output)
var temp = normalizedOutput.select("normalizedfeatures")
var outputs = temp.rdd.map(_.getAs[org.apache.spark.mllib.linalg.Vector]("normalizedfeatures"))
var gmm = new GaussianMixture().setK(2).setMaxIterations(10000).setSeed(25).run(outputs)

Output code :

for (i <- 0 until gmm.k) {
  println("weight=%f\nmu=%s\nsigma=\n%s\n" format
    (gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}

And therefore the points are being predicted in the same cluster for all the points . var ol=gmm.predict(outputs).toDF

zero323
  • 322,348
  • 103
  • 959
  • 935
Leothorn
  • 1,345
  • 1
  • 23
  • 45

1 Answers1

1

I am also having this issue. The weights and the gaussians are always the same. It seems independent of K.

My code is pretty simple. My data is 39 dimensional vectors of doubles. I just train like this...

val gmm = new GaussianMixture().setK(2).run(vectors)
for (i <- 0 until gmm.k) {
  println("weight=%f\nmu=%s\nsigma=\n%s\n" format
    (gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}

I tried KMeans, and it worked as expected. So I thought this has to be a bug with GaussianMixture.

But then I tried clustering just the first dimension, and it worked. Now I think it must be an EM issue with to little data... except I have lots.

Any GMM experts out there? How much data does one need GaussianMixture and 39 dimensions.

Or is this a bug after all?

opus111
  • 2,744
  • 4
  • 25
  • 41