0

Overview

This code utilises a cluster function that operates on one dimensional arrays and finds the clusters within an array defined by margins to the left and right of every point. I would like to use DBSCAN to replicate this functionality.

Imports:

import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN 

Create a test df:

df2 = pd.DataFrame(
       {'AAA' : [80], 
        'BBB' : [85],
        'CCC' : [100],
        'DDD' : [98],
        'EEE' : [103],
        'FFF' : [105],
        'GGG' : [109],
        'HHH' : [200]});
df2

Original code using Numpy/Pandasfor reference

This is what I am trying to replicate with DBSCAN.

Set the threshold for clustering

thresh = 5

Delta clustering function: This finds the clusters within an array defined by margins to the left and right of every point.

def delta_cluster(a, dleft, dright):
    s = a.argsort()
    y = s.argsort()
    a = a[s]
    rng = np.arange(len(a))

    edge_left = a.searchsorted(a - dleft)
    starts = edge_left == rng

    edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
    ends = edge_right == rng

    return (starts & ends).cumsum()[y]

Apply the function on our test df

def applyDeltaCluster(df):
    clusters = pd.DataFrame(
        np.apply_along_axis(delta_cluster, 1, df2.values, 10, 10),
        df.index, df2.columns).stack()

    lvl0 = clusters.index.get_level_values(0)
    size = clusters.groupby([lvl0, clusters]).transform('size')

    val = df2.stack().to_frame('value').set_index(clusters, append=True).value

    return val.mask(size.values == 1).dropna().unstack(1).reset_index(drop=True)

applyDeltaCluster(df2)

Output with a cluster per row. This is also the desired output for the DBSCAN function

    AAA     BBB     CCC     DDD     EEE     FFF     GGG
0   80.0    85.0    NaN     NaN     NaN     NaN     NaN
1   NaN     NaN     100.0   98.0    103.0   105.0   109.0

DBSCAN

What have I tried?

This is the DBSCAN code I have so far. If I take df2 and reshape to a 1 dimensional columnar array I can use the following function:

def DBSCAN_cluster(a, thresh):
    eps = thresh
    # Compute DBSCAN
    db = DBSCAN(eps, min_samples=2).fit(a)
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True
    labels = db.labels_
    # Number of clusters in labels, ignoring noise if present.
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise_ = list(labels).count(-1)
    print('Estimated number of clusters: %d' % n_clusters_)
    print('Estimated number of noise points: %d' % n_noise_)
    labels = db.labels_
    return labels

DBSCAN_cluster(df2.values.reshape(-1, 1),thresh)

This returns 2 clusters as expected.

Estimated number of clusters: 2
Estimated number of noise points: 1
Out[11]:
array([ 0,  0,  1,  1,  1,  1,  1, -1])

I'm unsure how to progress from here and achieve the desired output which is a Pandas DataFrame, with a row per cluster, as per the example above.

halfer
  • 19,824
  • 17
  • 99
  • 186
nipy
  • 5,138
  • 5
  • 31
  • 72

1 Answers1

1

Not so sure what you want to do with the -1 , assuming you get your labels back like this:

def DBSCAN_cluster(a, eps):
    
    db = DBSCAN(eps, min_samples=2).fit(a)
    return db.labels_

lbl = DBSCAN_cluster(df2.T,5)
idx = np.unique(lbl)

You can use pd.concat to fill in the missing:

res = pd.concat([df2.iloc[:,lbl==i] for i in idx],keys=idx)

       HHH   AAA   BBB    CCC   DDD    EEE    FFF    GGG
-1 0  200.0   NaN   NaN    NaN   NaN    NaN    NaN    NaN
 0 0    NaN  80.0  85.0    NaN   NaN    NaN    NaN    NaN
 1 0    NaN   NaN   NaN  100.0  98.0  103.0  105.0  109.0

If you do not want the -1 and have it in the same order as the data frame, you can just do:

res[1:][df2.columns]
      AAA   BBB    CCC   DDD    EEE    FFF    GGG  HHH
0 0  80.0  85.0    NaN   NaN    NaN    NaN    NaN  NaN
1 0   NaN   NaN  100.0  98.0  103.0  105.0  109.0  NaN  


  
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • Thanks. How would you approach getting the optimal value for epsilon on a similar shaped but larger dataset? I'm looking for the maximum clusters with lowest epsilon. – nipy Mar 14 '21 at 00:57
  • 2
    I don't think there's a method that will work for sure. one of the commonly used is calculating distance from one point to its k nearest neighbour, and look for the elbow in the plot. see something like https://stackoverflow.com/questions/43160240/how-to-plot-a-k-distance-graph-in-python – StupidWolf Mar 14 '21 at 15:04