10

I have a database of 350,000 strings with an average length of about 500. The strings are not made up of words, they are an essentially random assortment of characters.

I need to make sure no two of the strings are too similar, where similarity is defined as edit distance divided by avg length of string. The division is because smaller edit distances are more acceptable for smaller strings. It is fine if a different metric is used for performance reasons, but edit distance is the preferred baseline metric.

Naively, we calculate edit distance with runtime O(a*b), where a,b are the length of the two strings. We do this for all n^2 pairs, which gives an overall runtime of O(n^2*a*b), clearly too large with n=350,000, a,b=500.

The database is in the form of a Python list read from a csv file. I'd like to process it in a Pythonic way, if possible.

How can this be sped up? I'm not sure how long the naive algorithm will take to finish (on the order of weeks) but it ideally should take less than a day to run.

Evan Weissburg
  • 1,564
  • 2
  • 17
  • 38
  • 1
    Constructing an [FST](https://blog.burntsushi.net/transducers/#levenshtein-automata) will allow you to do the search much faster. – user2722968 Feb 16 '18 at 07:20
  • Can you tell us more about the database? Is it a DBMS? NOSQL? What are we dealing with here? Are you trying to write a query to do this, or are you loading all the strings from the DB and doing your calculations in the python code itself? How much does it need to be sped up? – mypetlion Feb 20 '18 at 00:35
  • @mypetlion Updated. – Evan Weissburg Feb 20 '18 at 00:36
  • 1
    Does `ratio` or `partial_ratio` from https://pypi.python.org/pypi/fuzzywuzzy work for you? Or you need edit distance only? – Tarun Lalwani Feb 20 '18 at 08:42
  • That should work @TarunLalwani. The issue is that it will probably still take too long `O(n^2*m)` time, assuming ratio runs in linear `m` time. – Evan Weissburg Feb 20 '18 at 17:39
  • First of all if you have a 500 length string, then you don't want to do a char by char comparison. You want to do it word by word. This will reduce the length of the words by 1/4 or 1/5 and hence less permutations. I had used this method to first cache the tokenization of all the strings and then do a comparison using multiprocessing pool. I have not looked under the hood complexity of this method – Tarun Lalwani Feb 20 '18 at 17:44
  • You should definitely write up a answer -- that sounds like exactly what I'm looking for. Nearly any string comparison is good enough for my purposes as long as it is consistent. – Evan Weissburg Feb 20 '18 at 17:46
  • 1
    It seems like that you might want to use some sort of https://en.wikipedia.org/wiki/Locality-sensitive_hashing . There are developed ones for hamming distance. – Haochen Wu Feb 20 '18 at 18:12
  • One thing you can do with locality sensitive hashing is basically hash all of your string (O(n)) and check if all the hashes are unique(O(n)). I would say bit sampling is probably a good choice here since you can tweak with the number of bits you sample. The only issue is it is based on hamming distance which is not exactly what you want. – Haochen Wu Feb 20 '18 at 18:19
  • Can I check if the hashes are similar? – Evan Weissburg Feb 20 '18 at 18:24
  • You don't. The idea is if two strings are similar enough, they should have the same hash. By tweaking with bits you sample for example you can specify what level of similarity will give you the same hash. – Haochen Wu Feb 20 '18 at 18:28
  • Definitely write-up an answer with a Pythonic code snippet to collect the bounty -- that sounds great. – Evan Weissburg Feb 20 '18 at 18:29
  • I tried and the speed is 3500 comparison/sec with 8 cores which wont work. The basic idea of improvement would be not do brute force one to one comparison and reduce comparison count – Tarun Lalwani Feb 20 '18 at 18:42
  • Please also look at https://github.com/scivey/relevanced/blob/master/docs/index.md – Tarun Lalwani Feb 21 '18 at 03:17
  • @EvanWeissburg - could you include a small representative sample of your strings? Or, if not - how many different characters do they comprise, and what does the distribution of character frequencies look like? Finally, could you give an example of two strings that are "too similar" (but not trivially so)? – Nathan Vērzemnieks Feb 22 '18 at 04:11

1 Answers1

4

I wrote a very brief prototype of a simple locality sensitive hashing algorithm in python. However there are a few caveats and you may want to optimize some pieces as well. I'll mention them when we see them.

Assume all your strings are stored in strings.

import random
from collections import Counter

MAX_LENGTH = 500
SAMPLING_LENGTH = 10

def bit_sampling(string, indices):
    return ''.join([string[i] if i<len(string) else ' ' for i in indices])

indices = random.sample(range(MAX_LENGTH),SAMPLING_LENGTH)
hashes = [bit_sampling(string, indices) for string in strings]

counter = Counter(hashes)
most_common, count = counter.most_common()[0]
while count > 1:
    dup_indices = [i for i, x in enumerate(hashes) if x == most_common]
    # You can use dup_indices to check the edit distance for original groups here.
    counter.pop(most_common)
    most_common, count = counter.most_common()[0]

First of all, this is a slight variant of bit sampling that works best for the general hamming distance. Ideally if all your string are of the same length, this can give a theoretical probability bound for the hamming distance. When the hamming distance between two string is small, it is very unlikely that they will have different hash. This can be specified by the parameter SAMPLING_LENGTH. A larger SAMPLING_LENGTH will make it more likely to hash similar string to different hash but also reduce the probability of hashing not very similar string to the same hash. For hamming distance, you can calculate this trade-off easily.

Run this snippet multiple times can increase your confident on no similar strings since each time you will sample different places.

To accommodate your purpose to compare different length strings, one possible approach is to left padding space on shorter strings and make copies of them.

Though all of the operation in this snippet are linear (O(n)), it may still consume significant memory and running time and it might be possible to reduce a constant factor.

You might also want to consider using more complicated locality sensitive hashing algorithm such as surveyed here: https://arxiv.org/pdf/1408.2927.pdf

Haochen Wu
  • 1,753
  • 1
  • 17
  • 24
  • Question before I implement and play around with this (and then accept your answer): have you runtime tested this with some strings (350,000 @ 500 each)? – Evan Weissburg Feb 20 '18 at 20:33
  • Just run on a synthetic dataset and it takes less than 1 min without problem. You might still want to wait a bit before accepting just in case others might have better answer, but you can start to play with it cause it's really fast. – Haochen Wu Feb 20 '18 at 20:45
  • True. Your explanation was very clean and concise and I appreciate the arxiv source as well. Thanks. – Evan Weissburg Feb 20 '18 at 20:46
  • I have a possible issue with the approach provided. Since the maximum length is 500, a sampling length of 500 should not show any duplicates. However, I find that counter has only 85,701 entries from a dataset of 350,820 using the above parameters. Is this expected behavior? – Evan Weissburg Feb 20 '18 at 21:42
  • This is probably specific to your data. I got 350000 entries in counter with random input. Did you check what input had the same hash? `dup_indices` should be able to point those entries. – Haochen Wu Feb 20 '18 at 21:48
  • Is my assumption correct that the only duplicates found with `max_len=500` and `sampling_len=500` are exact character for character duplicates? – Evan Weissburg Feb 20 '18 at 21:50
  • Yes, it should be. Are the input actually the same? – Haochen Wu Feb 20 '18 at 21:51
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/165523/discussion-between-evan-weissburg-and-haochen-wu). – Evan Weissburg Feb 20 '18 at 21:53
  • @EvanWeissburg I am curious if the above code work for you. As I understand correctly, you are looking for edit distance based similarity, right? As Wu has mentioned that it might work well for hamming distance. Two strings that have small edit distance can have large hamming distance. – viz12 Feb 21 '18 at 17:11
  • @vis12 It's not a perfect metric for the task but it is lightning fast. False positive duplicates are less of an issue than not completely identifying duplicates, so it works fine for me since I can tune the sampling length. – Evan Weissburg Feb 21 '18 at 17:19
  • 1
    @EvanWeissburg I am glad it worked well for you. I just saw your chat discussions, as you are working on biological data hamming distance will work unless there are lots of insertions and deletions. – viz12 Feb 21 '18 at 17:30
  • It seems as though this approach won't detect strings that are identical except for even a single insertion or deletion, let alone ones that are less similar. Won't that be a problem? You're not going to get a lot of false positives but rather a lot of false negatives, at least if two strings of different lengths can be "too similar". – Nathan Vērzemnieks Feb 22 '18 at 04:00
  • @EvanWeissburg - did you try introducing a string to your dataset that would be "too similar" to an existing one and seeing if this approach detects it? – Nathan Vērzemnieks Feb 22 '18 at 04:10
  • @NathanVērzemnieks I will verify this tomorrow. Most of the protein strings I am working with that are duplicates have substitutions (perhaps tested in a study for the effect of substitution) as opposed to addition/deletions. I'm still looking for a better solution if possible. – Evan Weissburg Feb 22 '18 at 04:14