Clustering suggestions: I have unlabelled dataset6 attributes(all numeric) and 100k datapoints. I want to do cluster similar datapoints

Question

As part of preprocessing:

I have removed attributes that are high in correlation(>0.8).
standardized the data(Standard Scalar)

`#To reduce it to lower dimensions I used umap =UMAP(n_neighbors=20, min_dist=0, spread=2, n_components=3, metric='euclidean') df_umap = umap.fit_transform(df_scaled1)

#For Clustering I used HDBSCAN

clusterer = hdbscan.HDBSCAN(min_cluster_size=30, max_cluster_size=100, prediction_data=True) clusterer.fit(df_umap)

#Assign clusters to the original dataset df['cluster'] = clusterer.labels_`

Data--(130351,6)

Column a	Column b	Column c	Column d	Column e	Column f
6.000194	7.0	1059216	353069.000000	26.863543	15.891751
3.001162	3.5	1303727	396995.666667	32.508957	11.215764
6.000019	7.0	25887	3379.000000	18.004558	10.993119
6.000208	7.0	201138	59076.666667	41.140104	10.972880
6.000079	7.0	59600	4509.666667	37.469000	9.667119

df.describe():

df.describe()

Results:

1.While some of the clusters have very similar data points; example: cluster: 1555, but a lot of them are having extreme data points associated with single cluster; example: cluster: 5423.

Also cluster id '-1' have 36221 data points associated with it.

My questions:

Am I using the correct approach for the data I have and the result I am trying to achieve?
Is UMAP the correct choice for dimension reduction?
Is HDBSCAN the right choice for this clustering problem? (I chose HDBSCAN, as it doesnt need any user input for defining number of clusters, the maximum and minimum data points associated to a cluster can be set before hand)
How to tune the clustering model to achieve better cluster quality ?(I am assuming with better cluster quality the points associated with cluster '-1' will also get clustered)
Is there any method to assess cluster quality?

Clustering suggestions: I have unlabelled dataset6 attributes(all numeric) and 100k datapoints. I want to do cluster similar datapoints

Data--(130351,6)

df.describe():

Results:

0 Answers0