Efficient string similarity search for huge corpora

Asked Mar 01 '22 at 03:34

Active Mar 01 '22 at 03:34

Viewed 209 times

I am doing a similarity search between a 256 characters long string and a corpus made of 9000 entries with each about 1000 words.

I used LocalitySensitiveHashing, see https://github.com/Jmkernes/Locality-sensitive-hashing-tutorial/blob/main/LocalitySensitiveHashing.ipynb . It creates pairs, which I filtered.

One problem here is that documents gets each entry with about 1000 words, which makes search inefficient, as it all has to remain in memory. In general, it is very slow.

The goal is to output the index of the corpus whose content has the biggest similarity to the 256 characters long string quickly.

My thoughts are: the entries need to be simplified and serialized to a file for quick recovery.

Which paper or implementation do you recommend?

asked Mar 01 '22 at 03:34

Per Bock

One answer is do it in batches instead of all at once. Another option is (if you're not) use a swap file so you can do more at once. This will be slower, but may help. You could also potentially try using Google Colab, as it typically will have _way_ higher specs than a local machine, but if you're using a server of some other high-powered device, this may not hold true. – cocomac Mar 01 '22 at 03:37
Have a look at Witten/Moffat/Bell: _Managing Gigabytes_, which covers processing of large amounts of texts. – Oliver Mason Mar 01 '22 at 08:40
@cocomac Colab is out of question as the data is sensitive. – Per Bock Mar 09 '22 at 18:56

Efficient string similarity search for huge corpora

0 Answers0