17

I have a dataset of 38 apartments and their electricity consumption in the morning, afternoon and evening. I am trying to clusterize this dataset using the k-Means implementation from scikit-learn, and am getting some interesting results.

First clustering results: Img

This is all very well, and with 4 clusters I obviously get 4 labels associated to each apartment - 0, 1, 2 and 3. Using the random_state parameter of KMeans method, I can fix the seed in which the centroids are randomly initialized, so consistently I get the same labels attributed to the same apartments.

However, as this specific case is in regards of energy consumption, a measurable classification between the highest and the lowest consumers can be performed. I would like, thus, to assign the label 0 to the apartments with lowest consumption level, label 1 to apartments that consume a bit more and so on.

As of now, my labels are [2 1 3 0], or ["black", "green", "blue", "red"]; I would like them to be [0 1 2 3] or ["red", "green", "black", "blue"]. How should I proceed to do so, while still keeping the centroid initialization random (with fixed seed)?

Thank you very much for the help!

Tonechas
  • 13,398
  • 16
  • 46
  • 80
Sergio
  • 357
  • 1
  • 3
  • 9

2 Answers2

23

Transforming the labels through a lookup table is a straightforward way to achieve what you want.

To begin with I generate some mock data:

import numpy as np

np.random.seed(1000)

n = 38
X_morning = np.random.uniform(low=.02, high=.18, size=38)
X_afternoon = np.random.uniform(low=.05, high=.20, size=38)
X_night = np.random.uniform(low=.025, high=.175, size=38)
X = np.vstack([X_morning, X_afternoon, X_night]).T

Then I perform clustering on data:

from sklearn.cluster import KMeans
k = 4
kmeans = KMeans(n_clusters=k, random_state=0).fit(X)

And finally I use NumPy's argsort to create a lookup table like this:

idx = np.argsort(kmeans.cluster_centers_.sum(axis=1))
lut = np.zeros_like(idx)
lut[idx] = np.arange(k)

Sample run:

In [70]: kmeans.cluster_centers_.sum(axis=1)
Out[70]: array([ 0.3214523 ,  0.40877735,  0.26911353,  0.25234873])

In [71]: idx
Out[71]: array([3, 2, 0, 1], dtype=int64)

In [72]: lut
Out[72]: array([2, 3, 1, 0], dtype=int64)

In [73]: kmeans.labels_
Out[73]: array([1, 3, 1, ..., 0, 1, 0])

In [74]: lut[kmeans.labels_]
Out[74]: array([3, 0, 3, ..., 2, 3, 2], dtype=int64)

idx shows the cluster center labels ordered from lowest to highest consumption level. The appartments for which lut[kmeans.labels_] is 0 / 3 belong to the cluster with the lowest / highest consumption levels.

Tonechas
  • 13,398
  • 16
  • 46
  • 80
  • 1
    I was looking for something built in in the scikit learn package, wondering if it was already implemented in the clustering methods. Not having that, your solution worked perfectly - thank you. – Sergio Jul 05 '17 at 07:37
0

Maybe sort the centroids based on their vector magnitude is better, since you can use it to predict other data using the same model. Here is my implementation in my repo

from sklearn.cluster import KMeans

def sorted_cluster(x, model=None):
    if model == None:
        model = KMeans()
    model = sorted_cluster_centers_(model, x)
    model = sorted_labels_(model, x)
    return model

def sorted_cluster_centers_(model, x):
    model.fit(x)
    new_centroids = []
    magnitude = []
    for center in model.cluster_centers_:
        magnitude.append(np.sqrt(center.dot(center)))
    idx_argsort = np.argsort(magnitude)
    model.cluster_centers_ = model.cluster_centers_[idx_argsort]
    return model

def sorted_labels_(sorted_model, x):
    sorted_model.labels_ = sorted_model.predict(x)
    return sorted_model

Example:

import numpy as np
arr = np.vstack([
    100 + np.random.random((2,3)),
    np.random.random((2,3)),
    5 + np.random.random((3,3)),
    10 + np.random.random((2,3))
])
print('Data:')
print(arr)

cluster = KMeans(n_clusters=4)

print('\n Without sort:')
cluster.fit(arr)
print(cluster.cluster_centers_)
print(cluster.labels_)
print(cluster.predict([[5,5,5],[1,1,1]]))

print('\n With sort:')
cluster = sorted_cluster(arr, cluster)
print(cluster.cluster_centers_)
print(cluster.labels_)
print(cluster.predict([[5,5,5],[1,1,1]]))

Output:

Data:
[[100.52656263 100.57376566 100.63087757]
 [100.70144046 100.94095196 100.57095386]
 [  0.21284187   0.75623797   0.77349013]
 [  0.28241023   0.89878796   0.27965047]
 [  5.14328748   5.37025887   5.26064209]
 [  5.21030632   5.09597417   5.29507699]
 [  5.81531591   5.11629056   5.78542656]
 [ 10.25686526  10.64181304  10.45651994]
 [ 10.14153211  10.28765705  10.20653228]]

 Without sort:
[[ 10.19919868  10.46473505  10.33152611]
 [100.61400155 100.75735881 100.60091572]
 [  0.24762605   0.82751296   0.5265703 ]
 [  5.38963657   5.19417453   5.44704855]]
[1 1 2 2 3 3 3 0 0]
[3 2]

 With sort:
[[  0.24762605   0.82751296   0.5265703 ]
 [  5.38963657   5.19417453   5.44704855]
 [ 10.19919868  10.46473505  10.33152611]
 [100.61400155 100.75735881 100.60091572]]
[3 3 0 0 1 1 1 2 2]
[1 0]
Muhammad Yasirroni
  • 1,512
  • 12
  • 22