1

So apparently.. the means_ attribute returns different results from the means I calculated per each cluster. (or I have a wrong understanding of what this returns!)

Following is the code I wrote to check how GMM fits to the time series data I have.

import numpy as np
import pandas as pd
import seaborn as sns
import time
import matplotlib as mpl
import matplotlib.pyplot as plt

from sklearn.mixture import BayesianGaussianMixture
from sklearn.mixture import GaussianMixture


toc = time.time()

input contains (# of meters/samples) x (# of features)

read = pd.read_csv('input', sep='\t', index_col= 0, header =0, \
               names =['meter', '6:30', '9:00', '15:30', '22:30', 'std_year', 'week_score', 'season_score'], \
               encoding= 'utf-8')
read.drop('meter', 1, inplace=True)
read['std_year'] = read['std_year'].divide(4).round(2)

input = read.as_matrix(columns=['6:30', '9:00', '15:30', '22:30',])

fit it into GMM, with 10 clusters. (using the BIC plot, 5 was the optimal number with the lowest score..but at -7,000. It isn't impossible, after a discussion with my advisor but still it is weird. )

gmm = GaussianMixture(n_components=10, covariance_type ='full', \
                  init_params = 'random', max_iter = 100, random_state=0)
gmm.fit(input)
print(gmm.means_.round(2))
cluster = gmm.predict(input)

What I do in the following is to calculate manually the centroid/center - if it is correct to use these terms to indicate mean vectors - of each cluster, using the labels returned from .predict.

To be specific, cluster contains a value from 0 to 9 each indicating the cluster. I transpose this and concatenate to the input matrix of (# of samples) x (# of attributes) as an array. I want to make use of the pandas library's easiness in handling such big data, so turn it into a dataframe.

cluster = np.array(cluster).reshape(-1,1) #(3488, 1)
ret = np.concatenate((cluster, input), axis=1) #(3488, 5)
ret_pd = pd.DataFrame(ret, columns=['label','6:30', '9:00', '15:30', '22:30'])
ret_pd['label'] = ret_pd['label'].astype(int)

For each meter's features, its cluster is classified under the column 'label'. So the following code clusters per each label and then I take the mean by column.

cluster_mean = []
for label in range(10):
#take mean by columns per each cluster
    segment= ret_pd[ret_pd['label']== label]
    print(segment)
    turn = np.array(segment)[:, 1:]
    print(turn.shape)
    mean_ = np.mean(turn, axis =0).round(2) #series
    print(mean_)
    plt.plot(np.array(mean_), label='cluster %s' %label) 

    cluster_mean.append(list(mean_))

print(cluster_mean)

xvalue = ['6:30', '9:00', '15:30', '22:30']
plt.ylabel('Energy Use [kWh]')
plt.xlabel('time of day')
plt.xticks(range(4), xvalue)
plt.legend(loc = 'upper center', bbox_to_anchor = (0.5, 1.05),\
       ncol =2, fancybox =True, shadow= True)
plt.savefig('cluster_gmm_100.png')

tic = time.time()
print('time ', tic-toc)

What is interesting is that the .means_ from the internal library returns different values from what I calculate in this code.

Scikit-learn's .means_:

[[ 0.46  1.42  1.12  1.35]
 [ 0.49  0.78  1.19  1.49]
 [ 0.49  0.82  1.01  1.63]
 [ 0.6   0.77  0.99  1.55]
 [ 0.78  0.75  0.92  1.42]
 [ 0.58  0.68  1.03  1.57]
 [ 0.4   0.96  1.25  1.47]
 [ 0.69  0.83  0.98  1.43]
 [ 0.55  0.96  1.03  1.5 ]
 [ 0.58  1.01  1.01  1.47]]

My results:

[[0.45000000000000001, 1.6599999999999999, 1.1100000000000001, 1.29],    
 [0.46000000000000002, 0.73999999999999999, 1.26, 1.48], 
[0.45000000000000001, 0.80000000000000004, 0.92000000000000004, 1.78], 
[0.68000000000000005, 0.72999999999999998, 0.85999999999999999, 1.5900000000000001], 
[0.91000000000000003, 0.68000000000000005, 0.84999999999999998, 1.3600000000000001], 
[0.58999999999999997, 0.65000000000000002, 1.02, 1.5900000000000001], 
[0.35999999999999999, 1.03, 1.28, 1.46], 
[0.77000000000000002, 0.88, 0.94999999999999996, 1.3500000000000001], 
[0.53000000000000003, 1.0700000000000001, 0.97999999999999998, 1.53], 
[0.66000000000000003, 1.21, 0.95999999999999996, 1.3600000000000001]]

As a side, I'm not sure why the results I return are not rounded to 2 decimal digits properly..

dia
  • 431
  • 2
  • 7
  • 22

1 Answers1

1

Though I'm not completely sure of what your code is doing, I fairly sure what the problem is here.

The parameters returned by means_ are the means of the parametric (Gaussian) distributions that make up the model. Where as when you are calculating the means you are doing it by taking the average of all data that is clustered in each component, this will almost always give different (though similar results). To get a better understanding of why these might differ I would suggest reading a bit more about the Expectation maximization algorithm that scikit-learn uses to fit GMM's.

piman314
  • 5,285
  • 23
  • 35
  • this was the intuition I was receiving but wasn't hundred pct sure of. Thx for confirming. – dia Mar 15 '18 at 12:19
  • @piman hi, I was wondering if you could have a look at a similar problem in my [post](https://stackoverflow.com/questions/63414169/how-can-implement-em-gmm-in-python?) Thanks in advance – Mario Aug 17 '20 at 13:05