So apparently.. the means_
attribute returns different results from the means I calculated per each cluster. (or I have a wrong understanding of what this returns!)
Following is the code I wrote to check how GMM fits to the time series data I have.
import numpy as np
import pandas as pd
import seaborn as sns
import time
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.mixture import BayesianGaussianMixture
from sklearn.mixture import GaussianMixture
toc = time.time()
input
contains (# of meters/samples) x (# of features)
read = pd.read_csv('input', sep='\t', index_col= 0, header =0, \
names =['meter', '6:30', '9:00', '15:30', '22:30', 'std_year', 'week_score', 'season_score'], \
encoding= 'utf-8')
read.drop('meter', 1, inplace=True)
read['std_year'] = read['std_year'].divide(4).round(2)
input = read.as_matrix(columns=['6:30', '9:00', '15:30', '22:30',])
fit it into GMM, with 10 clusters. (using the BIC plot, 5 was the optimal number with the lowest score..but at -7,000. It isn't impossible, after a discussion with my advisor but still it is weird. )
gmm = GaussianMixture(n_components=10, covariance_type ='full', \
init_params = 'random', max_iter = 100, random_state=0)
gmm.fit(input)
print(gmm.means_.round(2))
cluster = gmm.predict(input)
What I do in the following is to calculate manually the centroid/center - if it is correct to use these terms to indicate mean vectors - of each cluster, using the labels returned from .predict
.
To be specific, cluster contains a value from 0 to 9 each indicating the cluster. I transpose this and concatenate to the input matrix of (# of samples) x (# of attributes) as an array. I want to make use of the pandas library's easiness in handling such big data, so turn it into a dataframe.
cluster = np.array(cluster).reshape(-1,1) #(3488, 1)
ret = np.concatenate((cluster, input), axis=1) #(3488, 5)
ret_pd = pd.DataFrame(ret, columns=['label','6:30', '9:00', '15:30', '22:30'])
ret_pd['label'] = ret_pd['label'].astype(int)
For each meter's features, its cluster is classified under the column 'label'. So the following code clusters per each label and then I take the mean by column.
cluster_mean = []
for label in range(10):
#take mean by columns per each cluster
segment= ret_pd[ret_pd['label']== label]
print(segment)
turn = np.array(segment)[:, 1:]
print(turn.shape)
mean_ = np.mean(turn, axis =0).round(2) #series
print(mean_)
plt.plot(np.array(mean_), label='cluster %s' %label)
cluster_mean.append(list(mean_))
print(cluster_mean)
xvalue = ['6:30', '9:00', '15:30', '22:30']
plt.ylabel('Energy Use [kWh]')
plt.xlabel('time of day')
plt.xticks(range(4), xvalue)
plt.legend(loc = 'upper center', bbox_to_anchor = (0.5, 1.05),\
ncol =2, fancybox =True, shadow= True)
plt.savefig('cluster_gmm_100.png')
tic = time.time()
print('time ', tic-toc)
What is interesting is that the .means_
from the internal library returns different values from what I calculate in this code.
Scikit-learn's .means_
:
[[ 0.46 1.42 1.12 1.35]
[ 0.49 0.78 1.19 1.49]
[ 0.49 0.82 1.01 1.63]
[ 0.6 0.77 0.99 1.55]
[ 0.78 0.75 0.92 1.42]
[ 0.58 0.68 1.03 1.57]
[ 0.4 0.96 1.25 1.47]
[ 0.69 0.83 0.98 1.43]
[ 0.55 0.96 1.03 1.5 ]
[ 0.58 1.01 1.01 1.47]]
My results:
[[0.45000000000000001, 1.6599999999999999, 1.1100000000000001, 1.29],
[0.46000000000000002, 0.73999999999999999, 1.26, 1.48],
[0.45000000000000001, 0.80000000000000004, 0.92000000000000004, 1.78],
[0.68000000000000005, 0.72999999999999998, 0.85999999999999999, 1.5900000000000001],
[0.91000000000000003, 0.68000000000000005, 0.84999999999999998, 1.3600000000000001],
[0.58999999999999997, 0.65000000000000002, 1.02, 1.5900000000000001],
[0.35999999999999999, 1.03, 1.28, 1.46],
[0.77000000000000002, 0.88, 0.94999999999999996, 1.3500000000000001],
[0.53000000000000003, 1.0700000000000001, 0.97999999999999998, 1.53],
[0.66000000000000003, 1.21, 0.95999999999999996, 1.3600000000000001]]
As a side, I'm not sure why the results I return are not rounded to 2 decimal digits properly..