I have a dataset and I want to use AgglomerativeClustering to find clusters.
I tried with some sample array, but not able to figure out how to set the distance_threshold. I thought of using this as I'm not aware of the number of clusters for similar set of data.
The sample code is as below.
corpus = ['Rose is a flower.', 'Apple is a fruit', 'Lily is a flower', 'Banana is a fruit', 'Jackfruit is a fruit', 'Mango is a fruit']
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)
transformer = FunctionTransformer(lambda x: X.todense(), accept_sparse=True)
X_dense = transformer.transform(X)
AG = AgglomerativeClustering(n_clusters=None, distance_threshold =2, linkage='ward')
y_km = AG.fit_predict(X_dense)
My question is;
- If I use distance_threshold as "2", I get all records in one cluster. If I use "1", I get six clusters. But if I select "1.5", I get 2 clusters; which for this example is correct. This was a sample data created, so I could play around and check for the correctness, but how this can be selected for a production kind of code?
Is there a better way of selecting distance_threshold ?