As part of preprocessing:
I have removed attributes that are high in correlation(>0.8).
standardized the data(Standard Scalar)
`#To reduce it to lower dimensions I used umap =UMAP(n_neighbors=20, min_dist=0, spread=2, n_components=3, metric='euclidean') df_umap = umap.fit_transform(df_scaled1)
#For Clustering I used HDBSCAN
clusterer = hdbscan.HDBSCAN(min_cluster_size=30, max_cluster_size=100, prediction_data=True) clusterer.fit(df_umap)
#Assign clusters to the original dataset df['cluster'] = clusterer.labels_`
Data--(130351,6)
Column a | Column b | Column c | Column d | Column e | Column f |
---|---|---|---|---|---|
6.000194 | 7.0 | 1059216 | 353069.000000 | 26.863543 | 15.891751 |
3.001162 | 3.5 | 1303727 | 396995.666667 | 32.508957 | 11.215764 |
6.000019 | 7.0 | 25887 | 3379.000000 | 18.004558 | 10.993119 |
6.000208 | 7.0 | 201138 | 59076.666667 | 41.140104 | 10.972880 |
6.000079 | 7.0 | 59600 | 4509.666667 | 37.469000 | 9.667119 |
df.describe():
Results:
1.While some of the clusters have very similar data points; example: cluster: 1555, but a lot of them are having extreme data points associated with single cluster; example: cluster: 5423.
- Also cluster id '-1' have 36221 data points associated with it.
My questions:
- Am I using the correct approach for the data I have and the result I am trying to achieve?
- Is UMAP the correct choice for dimension reduction?
- Is HDBSCAN the right choice for this clustering problem? (I chose HDBSCAN, as it doesnt need any user input for defining number of clusters, the maximum and minimum data points associated to a cluster can be set before hand)
- How to tune the clustering model to achieve better cluster quality ?(I am assuming with better cluster quality the points associated with cluster '-1' will also get clustered)
- Is there any method to assess cluster quality?