4

I have a dataset and I want to use AgglomerativeClustering to find clusters.

I tried with some sample array, but not able to figure out how to set the distance_threshold. I thought of using this as I'm not aware of the number of clusters for similar set of data.

The sample code is as below.

corpus = ['Rose is a flower.', 'Apple is a fruit', 'Lily is a flower', 'Banana is a fruit', 'Jackfruit is a fruit', 'Mango is a fruit']
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
transformer = FunctionTransformer(lambda x: X.todense(), accept_sparse=True)
X_dense = transformer.transform(X)
AG = AgglomerativeClustering(n_clusters=None, distance_threshold =2, linkage='ward')
y_km = AG.fit_predict(X_dense)

My question is;

  1. If I use distance_threshold as "2", I get all records in one cluster. If I use "1", I get six clusters. But if I select "1.5", I get 2 clusters; which for this example is correct. This was a sample data created, so I could play around and check for the correctness, but how this can be selected for a production kind of code?

Is there a better way of selecting distance_threshold ?

A3006
  • 1,051
  • 1
  • 11
  • 28

0 Answers0