6

The Scenario:

I'm performing Clustering over Movie Lens Dataset, where I have this Dataset in 2 formats:

OLD FORMAT:

uid iid rat
941 1   5
941 7   4
941 15  4
941 117 5
941 124 5
941 147 4
941 181 5
941 222 2
941 257 4
941 258 4
941 273 3
941 294 4

NEW FORMAT:

uid 1               2               3               4
1   5               3               4               3
2   4               3.6185548023    3.646073985     3.9238342172
3   2.8978348799    2.6692556753    2.7693015618    2.8973463681
4   4.3320762062    4.3407749532    4.3111995162    4.3411425423
940 3.7996234581    3.4979386925    3.5707888503    2
941 5               NaN             NaN             NaN
942 4.5762594612    4.2752554573    4.2522440019    4.3761477591
943 3.8252406362    5               3.3748860659    3.8487417604

over which I need to perform Clustering using KMeans, DBSCAN and HDBSCAN. With KMeans I'm able to set and get clusters.

The Problem

The Problem persists only with DBSCAN & HDBSCAN that I'm unable to get enough amount of clusters (I do know we cannot set Clusters manually)

Techniques Tried:

  • Tried this with IRIS data-set, where I found Species wasn't included. Clearly that is in String and besides is to be predicted, and everything just works fine with that Dataset (Snippet 1)
  • Tried with Movie Lens 100K dataset in OLD FORMAT (with and without UID) since I tried an Analogy that, UID == SPECIES and hence tried without it. (Snippet 2)
  • Tried same with NEW FORMAT (with and without UID) yet the results ended up in same style.

Snippet 1:

print "\n\n FOR IRIS DATA-SET:"
from sklearn.datasets import load_iris

iris = load_iris()
dbscan = DBSCAN()

d = pd.DataFrame(iris.data)
dbscan.fit(d)
print "Clusters", set(dbscan.labels_)

Snippet 1 (Output):

FOR IRIS DATA-SET:
Clusters set([0, 1, -1])
Out[30]: 
array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,
        1,  1,  1,  1,  1,  1, -1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,
       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1,  1,  1,
        1,  1,  1, -1, -1,  1, -1, -1,  1,  1,  1,  1,  1,  1,  1, -1, -1,
        1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1,  1, -1, -1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1])

Snippet 2:

import pandas as pd
from sklearn.cluster import DBSCAN

data_set = pd.DataFrame

ch = int(input("Extended Cluster Methods for:\n1. Main Matrix IBCF \n2. Main Matrix UBCF\nCh:"))
if ch is 1:
    data_set = pd.read_csv("MainMatrix_IBCF.csv")
    data_set = data_set.iloc[:, 1:]
    data_set = data_set.dropna()
elif ch is 2:
    data_set = pd.read_csv("MainMatrix_UBCF.csv")
    data_set = data_set.iloc[:, 1:]
    data_set = data_set.dropna()
else:
    print "Enter Proper choice!"

print "Starting with DBSCAN for Clustering on\n", data_set.info()

db_cluster = DBSCAN()
db_cluster.fit(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)

Snippet 2 (Output):

Extended Cluster Methods for:
1. Main Matrix IBCF 
2. Main Matrix UBCF
Ch:>? 1
Starting with DBSCAN for Clustering on
<class 'pandas.core.frame.DataFrame'>
Int64Index: 942 entries, 0 to 942
Columns: 1682 entries, 1 to 1682
dtypes: float64(1682)
memory usage: 12.1 MB
None
Clusters assigned are: set([-1])

As seen, it returns only 1 Cluster. I'd like to hear what am I doing wrong.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
T3J45
  • 717
  • 3
  • 12
  • 32

4 Answers4

8

As pointed by @faraway and @Anony-Mousse, the solution is more Mathematical on Dataset than Programming.

Could finally figure out the clusters. Here's how:

db_cluster = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2)
arr = db_cluster.fit_predict(data_set)
print "Clusters assigned are:", set(db_cluster.labels_)

uni, counts = np.unique(arr, return_counts=True)
d = dict(zip(uni, counts))
print d

The Epsilon and Out-lier concept turned out more brightening from SO: How can I choose eps and minPts (two parameters for DBSCAN algorithm) for efficient results?.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
T3J45
  • 717
  • 3
  • 12
  • 32
  • 1
    With min_samples=2, you are not really doing DBSCAN, but single-linkage clustering. For real DBSCAN, choose larger minimum sizes (otherwise, everything is dense). – Erich Schubert Jan 17 '18 at 13:40
  • I tried increasing, however it returns more outliers. Any solution to that? @Erich – T3J45 Jan 17 '18 at 14:27
  • I tried increasing, however it returns more outliers. Any solution to that? @Erich – T3J45 Jan 17 '18 at 14:30
  • So, is there a standard not to set min_samples to 2? Is there any equation that can retain min_samples w.r.t. Dataset? – T3J45 Jan 18 '18 at 11:41
  • Well, as said before, with min_samples<=2, you are getting a **single-linkage clustering**, which long predates DBSCAN. If you want **density** based clustering, you need to use enough samples to get density. Define "retain" for the second part. – Erich Schubert Jan 18 '18 at 13:13
  • Retain meaning, determining the number based on dataset one is handling. – T3J45 Jan 18 '18 at 13:15
  • "Density = points / radius" has a fairly stable meaning in many applications, if enough points are considered. This depends on how well your distance function retains its meaning. – Erich Schubert Jan 18 '18 at 16:12
5

You need to choose appropriate parameters. With a too small epsilon, everything becomes noise. sklearn shouldn't have a default value for this parameter, it needs to be chosen for each data set differently.

You also need to preprocess your data.

It's trivial to get "clusters" with kmeans that are meaningless...

Don't just call random functions. You need to understand what you are doing, or you are just wasting your time.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Great Advice, but I can't really put out my actual objective and code here. Just understand that I need these 2 clustering methods to be done. If you can point out preprocessing required and parameters to use, that would be of real use to me. – T3J45 Jan 02 '18 at 03:35
  • 3
    Read the DBSCAN paper. The parameters are documented there. Preprocessing is similar to what is needed to make kmeans return *meaningful* results if you use Euclidean distance (but in contrast to kmeans, you can use other distances that are more relevant for you mystery objective). – Has QUIT--Anony-Mousse Jan 02 '18 at 09:10
1

Firstly you need to preprocess your data removing any useless attribute such as ids, and incomplete instances (in case your chosen distance measure can't handle it).

It's good to understand that these algorithms are from two different paradigms, centroid-based (KMeans) and density-based (DBSCAN & HDBSCAN*). While centroid-based algorithms usually have the number of clusters as a input parameter, density-based algorithms need the number of neighbors (minPts) and the radius of the neighborhood (eps).

Normally in the literature the number of neighbors (minPts) is set to 4 and the radius (eps) is found through analyzing different values. You may find HDBSCAN* easier to use as you only need to inform the number of neighbors (minPts).

If after trying different configurations, you still getting useless clusterings, maybe your data haven't clusters at all and the KMeans output is meaningless.

faraway
  • 11
  • 3
0

Have you tried seeing how the cluster looks in 2D space using PCA (e.g). If whole data is dense and actually forms single group probably then you might get single cluster.

Change other parameters like min_samples=5, algorithm, metric. Possible value of algorithm and metric you can check from sklearn.neighbors.VALID_METRICS.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
ram4189
  • 21
  • 4