Overview
This code utilises a cluster function that operates on one dimensional arrays and finds the clusters within an array defined by margins to the left and right of every point. I would like to use DBSCAN to replicate this functionality.
Imports:
import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
Create a test df:
df2 = pd.DataFrame(
{'AAA' : [80],
'BBB' : [85],
'CCC' : [100],
'DDD' : [98],
'EEE' : [103],
'FFF' : [105],
'GGG' : [109],
'HHH' : [200]});
df2
Original code using Numpy/Pandasfor reference
This is what I am trying to replicate with DBSCAN.
Set the threshold for clustering
thresh = 5
Delta clustering function: This finds the clusters within an array defined by margins to the left and right of every point.
def delta_cluster(a, dleft, dright):
s = a.argsort()
y = s.argsort()
a = a[s]
rng = np.arange(len(a))
edge_left = a.searchsorted(a - dleft)
starts = edge_left == rng
edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
ends = edge_right == rng
return (starts & ends).cumsum()[y]
Apply the function on our test df
def applyDeltaCluster(df):
clusters = pd.DataFrame(
np.apply_along_axis(delta_cluster, 1, df2.values, 10, 10),
df.index, df2.columns).stack()
lvl0 = clusters.index.get_level_values(0)
size = clusters.groupby([lvl0, clusters]).transform('size')
val = df2.stack().to_frame('value').set_index(clusters, append=True).value
return val.mask(size.values == 1).dropna().unstack(1).reset_index(drop=True)
applyDeltaCluster(df2)
Output with a cluster per row. This is also the desired output for the DBSCAN function
AAA BBB CCC DDD EEE FFF GGG
0 80.0 85.0 NaN NaN NaN NaN NaN
1 NaN NaN 100.0 98.0 103.0 105.0 109.0
DBSCAN
What have I tried?
This is the DBSCAN code I have so far. If I take df2 and reshape to a 1 dimensional columnar array I can use the following function:
def DBSCAN_cluster(a, thresh):
eps = thresh
# Compute DBSCAN
db = DBSCAN(eps, min_samples=2).fit(a)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
labels = db.labels_
return labels
DBSCAN_cluster(df2.values.reshape(-1, 1),thresh)
This returns 2 clusters as expected.
Estimated number of clusters: 2
Estimated number of noise points: 1
Out[11]:
array([ 0, 0, 1, 1, 1, 1, 1, -1])
I'm unsure how to progress from here and achieve the desired output which is a Pandas DataFrame, with a row per cluster, as per the example above.