0

As part of preprocessing:

  1. I have removed attributes that are high in correlation(>0.8).

  2. standardized the data(Standard Scalar)

    `#To reduce it to lower dimensions I used umap =UMAP(n_neighbors=20, min_dist=0, spread=2, n_components=3, metric='euclidean') df_umap = umap.fit_transform(df_scaled1)

    #For Clustering I used HDBSCAN

    clusterer = hdbscan.HDBSCAN(min_cluster_size=30, max_cluster_size=100, prediction_data=True) clusterer.fit(df_umap)

    #Assign clusters to the original dataset df['cluster'] = clusterer.labels_`

Data--(130351,6)

Column a Column b Column c Column d Column e Column f
6.000194 7.0 1059216 353069.000000 26.863543 15.891751
3.001162 3.5 1303727 396995.666667 32.508957 11.215764
6.000019 7.0 25887 3379.000000 18.004558 10.993119
6.000208 7.0 201138 59076.666667 41.140104 10.972880
6.000079 7.0 59600 4509.666667 37.469000 9.667119

df.describe():

df.describe()

Results:

1.While some of the clusters have very similar data points; example: cluster: 1555, but a lot of them are having extreme data points associated with single cluster; example: cluster: 5423.

  1. Also cluster id '-1' have 36221 data points associated with it.

My questions:

  1. Am I using the correct approach for the data I have and the result I am trying to achieve?
  2. Is UMAP the correct choice for dimension reduction?
  3. Is HDBSCAN the right choice for this clustering problem? (I chose HDBSCAN, as it doesnt need any user input for defining number of clusters, the maximum and minimum data points associated to a cluster can be set before hand)
  4. How to tune the clustering model to achieve better cluster quality ?(I am assuming with better cluster quality the points associated with cluster '-1' will also get clustered)
  5. Is there any method to assess cluster quality?

0 Answers0