python unsupervised learning dbscan scikit application example

Question

I have following list that I would like to perform unsupervised learning on and use the knowledge to predict a value for each item in the test list

#Format [real_runtime, processors, requested_time, score, more_to_be_added]
#some entries from the list

Training dataset

Xsrc = [['354', '2048', '3600', '53.0521472395'], 
      ['605', '2048', '600', '54.8768871369'], 
      ['128', '2048', '600', '51.0'], 
      ['136', '2048', '900', '51.0000000563'], 
      ['19218', '480', '21600', '51.0'], 
      ['15884', '2048', '18000', '51.0'], 
      ['118', '2048', '1500', '51.0'], 
      ['103', '2048', '2100', '51.0000002839'], 
      ['18542', '480', '21600', '51.0000000001'], 
      ['13272', '2048', '18000', '51.0000000001']]

Test data set

Using the clusters I would like to predict the real_runtime of a new list: Xtest= [['-1', '2048', '1500', '51.0000000161'], ['-1', '2048', '10800', '51.0000000002'], ['-1', '512', '21600', '-1'], ['-1', '512', '2700', '51.0000000004'], ['-1, '1024', '21600', '51.1042617556']]

Code: Formatting the list and Making clusters using scikit in python and plotting the clusters

from sklearn.feature_selection import VarianceThreshold
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

##Training dataset
Xsrc = [['354', '2048', '3600', '53.0521472395'], 
      ['605', '2048', '600', '54.8768871369'], 
      ['128', '2048', '600', '51.0'], 
      ['136', '2048', '900', '51.0000000563'], 
      ['19218', '480', '21600', '51.0'], 
      ['15884', '2048', '18000', '51.0'], 
      ['118', '2048', '1500', '51.0'], 
      ['103', '2048', '2100', '51.0000002839'], 
      ['18542', '480', '21600', '51.0000000001'], 
      ['13272', '2048', '18000', '51.0000000001']]

print "Xsrc:", Xsrc

##Test data set
Xtest= [['1224', '2048', '1500', '51.0000000161'],
       ['7867', '2048', '10800', '51.0000000002'],
       ['21594', '512', '21600', '-1'], 
       ['1760', '512', '2700', '51.0000000004'],
       ['115', '1024', '21600', '51.1042617556']]


##Clustering 
X = StandardScaler().fit_transform(Xsrc)
db = DBSCAN(min_samples=2).fit(X) #no clustering parameter, such as default eps
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
clusters = [X[labels == i] for i in xrange(n_clusters_)]

print('Estimated number of clusters: %d' % n_clusters_)
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))


##Plotting the dataset
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = 'k'

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=20)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=10)


plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Any ideas how I can use the clusters to predict the value?

Why are you clustering, rather than doing a simple multivariate regression? — Prune, Nov 02 '16 at 21:49
As Prune already suggested - there is no point in doing clustering in the first place. The only reason to do so would be because you have some idea how to use it later, and **not doing** that **did not work**. In general - this is not how you should approach the problem, start with simplest solutions and if they fail - look for some more complex ones. — lejlot, Nov 02 '16 at 22:35
The initial idea was to learn from hundreds of records and use that knowledge to predict one attribute of the next one. They reason I considered clustering is because the new record(predict) will have similarities to some of the records learned(processed) so far not all of them. — SummonersRift, Nov 02 '16 at 23:44

score 1 · Accepted Answer · answered Nov 03 '16 at 06:38

Clustering is not prediction

There is little use in "predicting" a cluster label, because it was just assigned "randomly" by the clustering algorithm.

Even worse: most algorithms cannot incorporate new data.

You really should use clustering to explore your data, and learn what is there and what not. Do not rely on the clustering being 'good'.

Sometimes, people have success with quantizing the data set into k centers, and then using only this "compressed" data set for classification/prediction (usually based on the nearest neighbor only). I have also seen the idea around of training onemregression per cluster for prediction, and choosing the regressor to apply using nearest neighbors (i.e. if the data fits a cluster well, use the clusters regression model). But I don't remember any major success stories...

python unsupervised learning dbscan scikit application example

Training dataset

Test data set

Code: Formatting the list and Making clusters using scikit in python and plotting the clusters

1 Answers1

Clustering is not prediction