I would like to cluster using GMM the classical iris dataset. I got the dataset from:
https://gist.github.com/netj/8836201
and my program so far is the following:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture as mix
from sklearn.cross_validation import StratifiedKFold
def main():
data=pd.read_csv("iris.csv",header=None)
data=data.iloc[1:]
data[4]=data[4].astype("category")
data[4]=data[4].cat.codes
target=np.array(data.pop(4))
X=np.array(data).astype(float)
kf=StratifiedKFold(target,n_folds=10,shuffle=True,random_state=1234)
train_ind,test_ind=next(iter(kf))
X_train=X[train_ind]
y_train=target[train_ind]
gmm_calc(X_train,"full",y_train)
def gmm_calc(X_train,cov,y_train):
print X_train
print y_train
n_classes = len(np.unique(y_train))
model=mix(n_components=n_classes,covariance_type="full")
model.means_ = np.array([X_train[y_train == i].mean(axis=0) for i in
xrange(n_classes)])
model.fit(X_train)
y_predict=model.predict(X_train)
print cov," ",y_train
print cov," ",y_predict
print (np.mean(y_predict==y_train))*100
The problem I got is when I try to get the number of coincidences y_predict=y_train, because every time I run the program I get different results. For example:
First run:
full [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 0 2 2 2 2 2 2 2 2 2
2 2 2 0 2 2 2 2 2 2 2 2 2 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
0.0
Second run:
full [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0
0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
33.33333333333333
Third run:
full [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
full [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
98.51851851851852
So, as you can see the results differ with every run. I found some code over the Internet that is in:
https://scikit-learn.org/0.16/auto_examples/mixture/plot_gmm_classifier.html
But they got with the full co-variance an accuracy of approximately 82% for the train set. What am I doing wrong in this case?
Thanks
Update: I found that on the internet example it was used GMM instead of the new GaussianMixture. I also found that in the example the GMM parameters were initialized in a supervised way with: classifier.means_ = np.array([X_train[y_train == i].mean(axis=0) for i in xrange(n_classes)])
I have put the modified code above, but still it changes the results every time that I run it, but with the library GMM it does not happen.