HDBSCAN on Movielens Latent embeddings does not cluster well

Question

I am working on a recommendation algorithm, and that has right now boiled down to finding the right clustering algorithm for the job.

Data

The data I'm working with is the MovieLens 100K dataset, from which I've extracted movie titles, genres and tags, and concatenated them into single documents (one for each movie). This gives me about 10000 documents. These have then been vectorized with TFDIF, which I have then autoencoded to 64-dim feature vectors (loss=0.0014 down from 22.14 in 30 epochs). The AutoEncoder is able to reconstruct the data well.

Clustering

Currently, I am working with HDBSCAN, as it should be able to handle datasets with varying density, with non-globular clustering, arbitrary cluster shapes, etc etc. It should be the correct algorithm to use here. The 2D representation of the original 64-dimensional data (gathered by TSNE) shows what seems to be a decently clusterable space, but I cannot get the HDBSCAN algorithm to work properly. Setting the min_cluster_size to 15-30 gives me this, any higher and it sees all points as noise, and lowering gives me this. Or, it just clusters a large majority of points into 1 cluster, with some additional very small clusters, and the rest as noise, like this. It just seems like it can't handle the data, but it does seem to be clusterable to me.

My Questions:

How can fiddling with parameters help HDBSCAN to cluster this space?
Is there a better algorithm for clustering such a space?
Or is the data simply non-clusterable, from what you can see in the plots?

Thanks so much in advance, I've been struggling with this for hours now.

Interesting project. A few things I would try: 1. Train your autoencoder with latent dimension of 2 and plot the embeddings 2. Maybe use an VAE instead of the vanilla AE 3. t-SNE is widely used but it has it's drawbacks, I'd try UMAP and PCA as well on the 64-dim embeddings. Finally, you have labels and tags for each movie, try coloring your embedding plots according to the tags and genres to see if those movies with the same tag/genre are close together in your embedding space. If not, you might want to work on more meaningful embeddings instead of the cluster algorithm. — Tinu, May 11 '22 at 11:55
BTW, why are you using an Autoencoder at all? Can't you just cluster your TFIDF vectors? How does that work? Again, I'd try t-SNE, UMAP, PCA directly on TFIDF vectors and color the points according to genre/tag to see if they are close. Try HDBSCAN (or other clustering algo) directly on the TFIDF vectors. — Tinu, May 11 '22 at 12:00
I think you might be right on the more meaningful embeddings part. They are going to be the input for a neural net that will generate recommendations out of them based on iterative clustering of items and user interests (which is why i need HDBSCAN), and some other algorithms that will determine whether a recommendation can be seen as different from what a user has already received. I'm using the AutoEncoder because training a neural net on 10000-dim TFIDF vectors is a bit heavy, at least much more so than 64-dim latent vectors. What would be the benefit of using a VAE vs vanilla AE? — Mhaexym, May 11 '22 at 12:40
Don't know which TFIDF library you use, but sklearn's TfidfVectorizer has an `max_features` parameter where you can specify the maximum features to be returned by TFIDF, so you could set that to 64 (or whatever) and skip the AE part in your model. VAE has the advantage that the latent embedding will follow a distribution (usually isotripic Gaussian) which makes it appealing for downstream tasks. See here for example: https://stats.stackexchange.com/q/324340/264183 — Tinu, May 11 '22 at 13:01
Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. — , May 11 '22 at 20:25
I believe the earlier commentor here has given more than adequate answers already, which have been more than helpful. and that if this question had a specific answer, then I would have found it by now. So it remains somewhat on the general side, but the main question would be: how would you use any algorithm (but preferably HDBSCAN) to cluster this space? — Mhaexym, May 12 '22 at 06:42
@Tinu I am now implementing a VAE, you convinced me that it will have many benefits later on in the project too, so if the clustering does not work here, Ill at least have the VAE. Setting TFIDF's max features definitely does not work. Setting the max_features does not make it keep the information that was in the rest of the features: it does not do any calculation, just gives you only the first 2 features out of 9000, which makes the vectors even worse. Before, however, I only vectorized unigrams, but I'm now also trying with up to trigrams because I have the genre-sequences too. — Mhaexym, May 12 '22 at 07:13

HDBSCAN on Movielens Latent embeddings does not cluster well

Data

Clustering

0 Answers0