0

I have been using k-Means for clustering a data into 2 classes. However, now, I would like to use a different approach and use Gaussian Mixture Model for Clustering the data into 2 classes. I have gone through Scikit-Learn documentation, and other SO questions, but am unable to understand how I can use GMM for 2 class clustering in my present context.

I am able to easily cluster the data into 2 classes using k-Means as follows:-

import pandas as pd
from scipy import stats
from sklearn.cluster import KMeans
import numpy as np

df = pd.read_pickle('my_df.pkl')
clmns = df.columns

df = df.fillna(df.mean())
df.isnull().any

df_tr_std = stats.zscore(df[clmns])

kmeans = KMeans(n_clusters = 2, random_state = 0, n_init = 100, max_iter=500, n_jobs = -1).fit(df_tr_std)
# >>> kmeans
# KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
#     n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
#     random_state=0, tol=0.0001, verbose=0)
labels = kmeans.labels_

I would appreciate any one liner/short code segment, which I can use to fit a GMM model on my data (df_tr_std). I am sure that this must be a fairly simple process to fit the GMM model, but I am very confused as to how my current k-Means context can be modified to a GMM one.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
JChat
  • 784
  • 2
  • 13
  • 33
  • How exactly can we do it for your (unknown to us) `df_tr_std` data? And what exactly is wrong or missing from the [iris](https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html#gmm-covariances) and [ellipsoids](https://scikit-learn.org/stable/auto_examples/mixture/plot_gmm.html#gaussian-mixture-model-ellipsoids) examples in the documentation? – desertnaut Mar 14 '19 at 16:45
  • Thanks for the comment. I agree that the data is unknown to you, but for that very purpose, I have included a boiler plate code for the fitting df_tr_std (Training data frame, which consists of a variety of features) into the k-Means model. The iris and other similar examples do this, but I can't figure out how the same can be applied in the present context. My main problem is that I can find things like gmm = GMM(n_components=4).fit(X) labels = gmm.predict(X), which can be used in a similar context (Reference https://jakevdp.github.io/PythonDataScienceHandbook/05.12-gaussian-mixtures.html). – JChat Mar 14 '19 at 17:04
  • But, it isn't evident if it is the most appropriate way to fit a GMM into a Pandas based Dataframe, which I can of course cast to a bumpy array. I would appreciate your kind help and suggestions in this regard. Please consider the k-Means code above as the current context, in which a df_tr_std is the data frame with all training features. – JChat Mar 14 '19 at 17:05

1 Answers1

0

Consider the following:

Mixture Gaussians

This equation will give you the gaussian distribution given your specific case x and the group mean , variance σ2 and standard deviation σ.

The Z score will give you where to cut the classes, assuming a probability of 0.5 in this point, and thus properly generate your different classes. C is the centroid of classes, N number of examples.

Two Gaussians

Gauss Centroids

razimbres
  • 4,715
  • 5
  • 23
  • 50
  • Thanks for your answer. However, I am already aware of the equations of the Gaussian Distribution and looking for ways to implement it in code (As mentioned in the question) in the present context, so that I can cluster my data into 2 classes accordingly. – JChat Mar 15 '19 at 09:41