How can I get the similarity matrix from minhash LSH?

Question

I have read many tutorials and tried a number of minhash LSH, but it cannot generate the similarity matrix, instead it returns just similar data which exceeds the threshold. How can I generate it? My intention is to use the LSH results for clustering.

score 1 · Accepted Answer · answered Jan 05 '18 at 09:38

1

The whole point of LSH is to avoid pairwise distances, because that does not scale.

If you then put the data into a distance matrix, you get all the scalability problems again!

Instead consider an algorithm like DBSCAN clustering. It doesn't need a distance matrix, only neighbors at distance epsilon.

answered Jan 05 '18 at 09:38

Has QUIT--Anony-Mousse

76,138
12
138
194

Thanks for your suggestion. I have tried using unionfind by rank to cluster, but I'm not satisfied with the result as the clusters are not well formed. DBSCAN sounds like a nice idea as it can separate noise from the clusters. What about HDBSCAN? It seems like a better algorithm than DBSCAN. Is there any code or tutorial which can help me use it with LSH? – z3r0 Jan 06 '18 at 16:24
HDBSCAN won't work with LSH, because it doesn't use a range threshold. – Has QUIT--Anony-Mousse Jan 06 '18 at 23:39
I see, but I am just wondering, can DBSCAN handle clusters of very different densities? – z3r0 Jan 08 '18 at 07:54
You can run it with different epsilons. If you want to use LSH, you *cannot* expect arbitrary densities to work. See the definition of LSH for the reason why. You need to remain below the threshold. OPTICS will be an option, but the performance will degrade the larger you choose your thresholds. And since the whole reason for LSH is to maximize performance, you want to have an as small as possible threshold. – Has QUIT--Anony-Mousse Jan 08 '18 at 08:08

How can I get the similarity matrix from minhash LSH?

1 Answers1