3

I'm trying to cluster hundreds of text documents so that each each cluster represents a distinct topic. Instead of using topic modeling (which I know I could do too), I want to follow a two-step approach:

  1. Create document embeddings with Sentence-BERT (using SentenceTransformer)
  2. Feed the embeddings into a cluster algorithm

I know I could e.g. use k-means for step 2, but I prefer a soft cluster algorithm as my documents sometimes belong to multiple topics. So I want to get a probability for each response to belong to each cluster. My embeddings have 768 dimensions and when implementing a soft cluster algorithm (Gaussian Mixture Models), I realized that the high dimensionality caused problems. So I was thinking about using a dimensionality reduction technique (e.g., PCA) and feed the factors into the cluster algorithm.

However, I'm not very familiar with dimensionality reduction in such high-dimensional space and especially not in the context of NLP. Can anyone advice on a good approach / method here?

Thank you!

Selina
  • 61
  • 5

1 Answers1

3

I think you should take a look at UMAP as an effective dim. reduction. Both PCA and UMAP are relatively quick and easy to use.

UMAP uses a predefined distance function as a similarity measure. It tries to preserve the distances between points in lower-dimensional space. This makes it perfect for SentenceBERT embedding as the model has a CosineLoss baked in.

https://umap-learn.readthedocs.io

Tomasz
  • 55
  • 1
  • 9