Gaussian Mixture Model in scala spark 1.5.1 weights are always uniformly distributed

Question

I implemented the default gmm model provided in mllib for my algorithm. I am repeatedly finding that the resultant weights are always equally waited no matter how many clusters i initiate. Is there any specific reason why the weights are not being adjusted ? Am I implementing it wrong ?

import org.apache.spark.mllib.clustering.GaussianMixture
import org.apache.spark.mllib.clustering.GaussianMixtureModel
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.Normalizer
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrameNaFunctions

var colnames= df.columns;
for(x<-colnames)
{   
    if (df.select(x).dtypes(0)._2.equals("StringType")|| df.select(x).dtypes(0)._2.equals("LongType"))
    {df = df.drop(x)}

}
colnames= df.columns;
var assembler = new VectorAssembler().setInputCols(colnames).setOutputCol("features")
var output = assembler.transform(df)
var normalizer= new Normalizer().setInputCol("features").setOutputCol("normalizedfeatures").setP(2.0)
var normalizedOutput = normalizer.transform(output)
var temp = normalizedOutput.select("normalizedfeatures")
var outputs = temp.rdd.map(_.getAs[org.apache.spark.mllib.linalg.Vector]("normalizedfeatures"))
var gmm = new GaussianMixture().setK(2).setMaxIterations(10000).setSeed(25).run(outputs)

Output code :

for (i <- 0 until gmm.k) {
  println("weight=%f\nmu=%s\nsigma=\n%s\n" format
    (gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}

And therefore the points are being predicted in the same cluster for all the points . var ol=gmm.predict(outputs).toDF

Try several seeds and see if you're always seeing the same behavior. — Jason Scott Lenderman, Mar 24 '16 at 02:55
I tried that same thing with multiple seed values. No improvement. Can someone just copy the code as is and see if you are getting different weights? I dont know if its wrong with spark itself or with my code — Leothorn, Mar 24 '16 at 06:22

opus111 · Answer 1 · 2017-05-02T03:14:50.273

I am also having this issue. The weights and the gaussians are always the same. It seems independent of K.

My code is pretty simple. My data is 39 dimensional vectors of doubles. I just train like this...

val gmm = new GaussianMixture().setK(2).run(vectors)
for (i <- 0 until gmm.k) {
  println("weight=%f\nmu=%s\nsigma=\n%s\n" format
    (gmm.weights(i), gmm.gaussians(i).mu, gmm.gaussians(i).sigma))
}

I tried KMeans, and it worked as expected. So I thought this has to be a bug with GaussianMixture.

But then I tried clustering just the first dimension, and it worked. Now I think it must be an EM issue with to little data... except I have lots.

Any GMM experts out there? How much data does one need GaussianMixture and 39 dimensions.

Or is this a bug after all?

I think it is. I have not been able to resolve this issue . – Leothorn Jun 01 '17 at 08:44 — Leothorn, Jun 01 '17 at 08:44

Gaussian Mixture Model in scala spark 1.5.1 weights are always uniformly distributed

1 Answers1