Fast and scalable similarity detection

Question

I have large postgresql database, containing documents. Every document represented as a row in the table. When new document added to the database I need to check for duplicates. But I can't just use select to find exact match. Two documents can vary slightly and still can be considered as a duplicates, for example if some minor fields are different and all other fields are equal.

I research this problem and find method to solve this problem. It is possible to calculate MinHash signature for every document and construct inverted index, to query similar documents from the database. But I can't understand how to map MinHash to relational database.

As I understand, MinHash signature is a list of N hashes, where N is a number of attributes. Similarity calculated as follows:

# Given 2 signatures Sa and Sb containing N hashes.
# Calculate number of equal hashes Neq.
number_of_equal_hashes = 0
for ix in range(0, N):
    if Sa[ix] == Sb[ix]:
        number_of_equal_hashes += 1
similarity = float(number_of_equal_hashes)/N

This is simple if you already have two signatures, the problem is to find all documents (with corresponding signatures) in the database with similarity less or equal some value.

For example, I can create table with multiple columns like this:

| minhash0 | minhash1 | minhash3 | docid |

Each minhashX column corresponds to minhash of the one of the document's attribute and docid is a document's identifier. I can query similar records this way:

select * from invidx
where ((case when minhash0=minhash2search0 then 1 else 0 end) +
       (case when minhash1=minhash2search1 then 1 else 0 end) +
       (case when minhash2=minhash2search2 then 1 else 0 end))/N > THRESHOLD

where minhash2searchX is minhashes of new document and THRESHOLD is minimal similarity. But this approach require full scan. Is there any method to speedup this algorithm?

Are you able to parse containing documents and add new row to the table? Possibly, you'll require store some kind of index of file (not hash) and compare it with side tool. — Evgeny Gavrin, Dec 05 '12 at 10:56
Are you able use an external full-text search engine, e.g. Lucene or Solr? — stemm, Dec 05 '12 at 10:57
Yes, but I can't imagine how this can help. My documents is just a lists of attributes. Attributes are mostly integers, some of them are strings but not that much. — Evgeny Lazin, Dec 05 '12 at 11:17
Did you try: `SELECT docid WHERE minhash1 = ? OR minhash2 = ? OR minhash3 = ?` to get the candidates? This *should* allow you RDBMS to use btrees for fast candidate retrieval. — Has QUIT--Anony-Mousse, Dec 05 '12 at 15:00
@Anony-Mousse, if I understand problem correctly - straightforward `SELECT ... WHERE ...` would retrieve too much irrelevant candidates. For example, if `THRESHOLD` is `0.9` - it means, that `90%` of fields have to be matched. But with `SELECT ... WHERE ...` - each document, which matches at least one field, would be retrieved — stemm, Dec 05 '12 at 15:56
Yes, but how many DocIDs are that? How selective is are your `minhash` values, what is the maximum cardinality of an exact match? It **can** be quite fast to select too much (as long as it is selective *enough*), and then filter in the second step, e.g. via a nested query. — Has QUIT--Anony-Mousse, Dec 05 '12 at 16:24
In fact this likely is how all the document search engines do it. Inverted lists work *exactly* this way: you get all the matches for each hash value and *count-join* them. — Has QUIT--Anony-Mousse, Dec 05 '12 at 16:25
@Anony-Mousse, As I know - search engines usually use different features to boost performance and memory usage, e.g. skip lists, in-memory indexes etc. So, I assume, that for really large amounts of data - usage of search engine brings some profit. But, yes, I agree that in some cases it is possible to dispense without search engine — stemm, Dec 05 '12 at 17:44
Google scale search engines do a lot of stuff we don't imagine. One thing is for sure: they don't use RDBMS as a backend. But I doubt you are the scale to worry about that yet. — Has QUIT--Anony-Mousse, Dec 05 '12 at 18:48
@Anony-Mousse, I agree with you. In my previous comment, I've just told about internals of Lucene engine. — stemm, Dec 05 '12 at 20:25
You've got interesting discussion here. I think `WHERE minhashX=? OR...` would work just fine because hash collisions are relatively rare (how about posting it as an answer?). Also, I'm thinking about more simple one to many mapping, from minhash to docid. There is no difference between minhash0 and minhashN, so, they can be represented with single column. We just get list of hashes generated from the new document and then for each hash we get list of documents, than we rank documents for similarity. — Evgeny Lazin, Dec 06 '12 at 06:49
And I find this book - http://infolab.stanford.edu/~ullman/mmds.html, chapter 3 is called "Finding Similar Items" and covers minhash and LSH. — Evgeny Lazin, Dec 06 '12 at 06:50

stemm · Accepted Answer · 2012-12-05T13:18:26.043

To use use advantages of inverted index, I'd suggest you full-text search engine for your purposes, e.g. Lucene or Solr (which is based on Lucene)

You can construct "document" (in terms of Lucene), which would contain fields, which associated with MinHashes of your documents (db records). Note, that you're able to index numeric fields as well (you're just need to describe field types in scheme). Also, you have to store primary key of each document, to map Lucene "documents" on records from your db.

Index entire collection of your documents.

For finding similar documents to given document - you're have to calculate MinHashes for each field, and query Lucene for similar documents:

field1:MinHash1 OR field2:MinHash2 OR ...

As more fields matched document - the higher rank it would have. So, you can take few documents with highest rank, and make a decision - if they are really similar in your case

Also, boosting of fields may be useful for you

Ben Whitmore · Answer 2 · 2019-02-13T02:55:06.760

Your hash table should contain two columns:

| minhash | docid |

It should be indexed on minhash.

When a new document arrives, you search on each of its minhashes in turn, querying the table to find prior documents sharing that minhash. You build up a tally of how many minhashes are shared by these prior documents, and then discard all those with fewer than (e.g.) 50% of the minhashes shared. This efficiently yields the set of all documents that are at least (estimated) 50% similar.

Finally you insert new rows for each of the new document's minhashes.

Using Lucene or Solr is a bad solution. It will require a lot more storage, will be more complex to implement, and vastly less efficient. Yes, you could get Lucene to index your minhashes and run a query as stemm suggests. This will return every document that shares even a single minhash, which could be tens or hundreds of thousands, depending on your data size. You then have to individually compare each one of these to your incoming document using the "Similarity" feature, which would be super slow.

Lucene does offer a "MoreLikeThis" feature to find documents sharing certain keywords, but this would miss many similar documents that a minhash approach would find.

As far as I can recall this was a solution I ended up with. – Evgeny Lazin Feb 14 '19 at 07:47 — Evgeny Lazin, Feb 14 '19 at 07:47

Fast and scalable similarity detection

2 Answers2