Speeding up a "closest" string match algorithm

Question

I am currently processing a very large database of locations and trying to match them with their real world coordinates.

To achieve this, I have downloaded the geoname dataset which contains a lot of entries. It gives possible names and lat/long coordinates. To try and speed up the process, I have managed to reduce the huge csv file (of 1.6 GB) to 0.450 GB by removing entries that do not make sense for my dataset. It still contains however 4 million entries.

Now I have many entries such as:

Slettmarkmountains seen from my camp site in Jotunheimen, Norway, last week
Adventuring in Fairy Glen, Isle of Skye, Scotland, UK
Morning in Emigrant Wilderness, California

Knowing that string matching with such long strings, I used Standford's NER via NLTK to get a better string to qualify my location. Now I have strings like:

Slettmarkmountains Jotunheimen Norway
Fairy Glen Skye Scotland UK
Emigrant Wilderness California
Yosemite National Park
Half Dome Yosemite National Park

The geoname dataset contains things like:

Jotunheimen Norway Lat Long
Slettmarkmountains Jotunheimen Norway Lat Long
Bryce Canyon Lat Long
Half Dome Lat Long
...

And I am applying this algorithm to get a good possible match between my entries and the geoname csv containing 4M entries. I first read the geoname_cleaned.csv file and put all of the data into a list. For each entry I have I then call for each one of my entries string_similarity() between the current entry and all the entries of the geoname_list

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    s = string.lower()
    return [s[i:i+2] for i in list(range(len(s) - 1))]

def string_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

I have tested the algorithm on a subset of my original dataset and it works fine but it is obviously terribly slow (takes up to 40 seconds for a single location). Since I have more than a million entries to process, this will take a good 10000 hours or more. I was wondering if you guys had any idea on how to speed this up. I thought of parallel processing obviously but I don't have any HPC solution available. Perhaps simple ideas could help me speed this up.

I'm open to any and every idea that you guys might have but would somehow prefer a python-compatible solution.

Thanks in advance :).

Edit:

I have tried fuzzywuzzy with fuzz.token_set_ratio(s1, s2) and it gives worst performances (running time is worse, and results are not as good). Matches are not as good as they used to be with my custom technique and running time has increased by a good 15 seconds for a single entry.

Edit 2:

I also though of using some kind of sorting at the beginning to help with the matching but my naive implementation would not work. But I'm sure there are some ways to speed this up, by perhaps getting rid of some entries in geoname dataset, or sorting them in some way. I already did a lot of cleaning to remove useless entries, but can't get the number much lower than 4M

PM 2Ring · Accepted Answer · 2018-08-27T16:51:19.233

We can speed up the matching in a couple of ways. I assume that in your code str1 is a name from your dataset and str2 is a geoname string. To test the code I made two tiny data sets from the data in your question. And I wrote two matching functions best_match and first_match that use your current string_similarity function so we can see that my strategy gives the same results. best_match checks all geoname strings & returns the string with the highest score if it exceeds a given threshold score, otherwise it returns None. first_match is (potentially) faster: it just returns the first geoname string that exceeds the threshold, or None if it can't find one, so if it doesn't find a match then it still has to search the entire geoname list.

In my improved version, we generate the bigrams for each str1 once, rather than re-generating the bigrams for str1 for each str2 that we compare it with. And we compute all the geoname bigrams in advance, storing them in a dict indexed by the string so that we don't have to regenerate them for each str. Also, we store the geoname bigrams as sets. That makes computing the hit_count much faster, since set membership testing is much faster than doing a linear scan over a list of strings. The geodict also needs to store the length of each bigram: a set contains no duplicate items, so the length of the set of bigrams may be smaller than the list of bigrams, but we need the list length to compute the score correctly.

# Some fake data
geonames = [
    'Slettmarkmountains Jotunheimen Norway',
    'Fairy Glen Skye Scotland UK',
    'Emigrant Wilderness California',
    'Yosemite National Park',
    'Half Dome Yosemite National Park',
]

mynames = [
    'Jotunheimen Norway',
    'Fairy Glen',
    'Slettmarkmountains Jotunheimen Norway',
    'Bryce Canyon',
    'Half Dome',
]

def get_bigrams(string):
    """
    Take a string and return a list of bigrams.
    """
    s = string.lower()
    return [s[i:i+2] for i in range(len(s) - 1)]

def string_similarity(str1, str2):
    """
    Perform bigram comparison between two strings
    and return a percentage match in decimal form.
    """
    pairs1 = get_bigrams(str1)
    pairs2 = get_bigrams(str2)
    union  = len(pairs1) + len(pairs2)
    hit_count = 0
    for x in pairs1:
        for y in pairs2:
            if x == y:
                hit_count += 1
                break
    return (2.0 * hit_count) / union

# Find the string in geonames which is the best match to str1
def best_match(str1, thresh=0.2):
    score, str2 = max((string_similarity(str1, str2), str2) for str2 in geonames)
    if score < thresh:
        str2 = None
    return score, str2

# Find the 1st string in geonames that matches str1 with a score >= thresh
def first_match(str1, thresh=0.2):
    for str2 in geonames:
        score = string_similarity(str1, str2)
        if score >= thresh:
            return score, str2
    return None

print('Best')
for mystr in mynames:
    print(mystr, ':', best_match(mystr))
print()

print('First')
for mystr in mynames:
    print(mystr, ':', best_match(mystr))
print()

# Put all the geoname bigrams into a dict
geodict = {}
for s in geonames:
    bigrams = get_bigrams(s)
    geodict[s] = (set(bigrams), len(bigrams))

def new_best_match(str1, thresh=0.2):
    pairs1 = get_bigrams(str1)
    pairs1_len = len(pairs1)

    score, str2 = max((2.0 * sum(x in pairs2 for x in pairs1) / (pairs1_len + pairs2_len), str2)
        for str2, (pairs2, pairs2_len) in geodict.items())
    if score < thresh:
        str2 = None
    return score, str2

def new_first_match(str1, thresh=0.2):
    pairs1 = get_bigrams(str1)
    pairs1_len = len(pairs1)

    for str2, (pairs2, pairs2_len) in geodict.items():
        score = 2.0 * sum(x in pairs2 for x in pairs1) / (pairs1_len + pairs2_len)
        if score >= thresh:
            return score, str2
    return None

print('New Best')
for mystr in mynames:
    print(mystr, ':', new_best_match(mystr))
print()

print('New First')
for mystr in mynames:
    print(mystr, ':', new_first_match(mystr))
print()

output

Best
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : (0.1875, None)
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')

First
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : (0.1875, None)
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')

New Best
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : (0.1875, None)
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')

New First
Jotunheimen Norway : (0.6415094339622641, 'Slettmarkmountains Jotunheimen Norway')
Fairy Glen : (0.5142857142857142, 'Fairy Glen Skye Scotland UK')
Slettmarkmountains Jotunheimen Norway : (1.0, 'Slettmarkmountains Jotunheimen Norway')
Bryce Canyon : None
Half Dome : (0.41025641025641024, 'Half Dome Yosemite National Park')

new_first_match is fairly straight-forward. The line

for str2, (pairs2, pairs2_len) in geodict.items():

loops over every item in geodict extracting each string, bigram set and true bigram length.

sum(x in pairs2 for x in pairs1)

counts how many of the bigrams in pairs1 are members of the pairs2 set.

So for each geoname string, we compute the similarity score and return it if it's >= the threshold, which has a default value of 0.2. You can give it a different default thresh, or pass a thresh when you call it.

new_best_match is a little more complicated. ;)

((2.0 * sum(x in pairs2 for x in pairs1) / (pairs1_len + pairs2_len), str2)
    for str2, (pairs2, pairs2_len) in geodict.items())

is a generator expression. It loops over the geodict items and creates a (score, str2) tuple for each geoname string. We then feed that generator expression to the max function, which returns the tuple with the highest score.

Here's a version of new_first_match that implements the suggestion that juvian made in the comments. It may save a little bit of time. This version also avoids testing if either bigram is empty.

def new_first_match(str1, thresh=0.2):
    pairs1 = get_bigrams(str1)
    pairs1_len = len(pairs1)
    if not pairs1_len:
        return None

    hiscore = 0
    for str2, (pairs2, pairs2_len) in geodict.items():
        if not pairs2_len:
            continue
        total_len = pairs1_len + pairs2_len
        bound = 2.0 * pairs1_len / total_len
        if bound >= hiscore:
            score = 2.0 * sum(x in pairs2 for x in pairs1) / total_len
            if score >= thresh:
                return score, str2
            hiscore = max(hiscore, score)
    return None

A simpler variation is to not bother computing hiscore & just compare bound to thresh.

This covers all the main things that I would have suggested, nice. A smaller optimization would be to avoid computing new_first_match if an upper bound of it (2 * len(pairs1) / (pairs1_len + pairs2_len)) does not reach current best thresh. — juvian, Aug 27 '18 at 15:09
@juvian Ok, I've done that. But I'm not sure how much of a benefit it will be, since `sum(x in pairs2 for x in pairs1)` will be pretty fast unless `pairs1` is huge. — PM 2Ring, Aug 27 '18 at 15:45
Well, if you process the dataset in order by length, there is a point where you can stop the search because all the following will have a lower upper bound (2 * pairs1_len is a constant, same as pairs1_len. The only thing increasing is pairs2_len, which will make for a lower upper bound). — juvian, Aug 27 '18 at 15:54
Thanks for the updated version will try that ASAP. Forgot to upvote, so here it is BTW :) — LBes, Aug 27 '18 at 16:07
Your new_best_match gives results in less than 10s, probably 5s on average, which is already approximately 6 times more efficient than what I had. However, your edited version of new_first_match gives me an error because of a division by 0 — LBes, Aug 27 '18 at 16:19
@LBes Oh. A `ZeroDivisionError` can only happen there if `pairs1_len + pairs2_len` is zero, and that can only happen if both the bigrams are empty. And that can only happen if both the strings being compared have 1 or 0 characters. So you should filter such strings out of your data. FWIW, that same issue will affect `string_similarity` and all my match functions. — PM 2Ring, Aug 27 '18 at 16:41
@PM2Ring ok I'll check it out, but I get very weird results with first_match anyways — LBes, Aug 27 '18 at 16:49
@LBes I've added code to the last version of `new_first_match` that avoids the `ZeroDivisionError`. But it's best if you just make sure it doesn't get strings with empty bigrams in the first place. I guess the `first_match` functions can return weird matches if `thresh` is too low. With those functions you should make `thresh` as high as you can to eliminate low scoring matches, but not so high that it eliminates valid ones. That will probably take a bit of trial & error. — PM 2Ring, Aug 27 '18 at 16:56

juvian · Answer 2 · 2018-08-27T17:10:34.237

I used SymSpell port to python for spell checking. If you want to try processInput, will need to add the code for it, better to use 2Ring adjustments to it.

from symspellpy.symspellpy import SymSpell, Verbosity  # import the module
import csv


geonames = [
    'Slettmarkmountains Jotunheimen Norway',
    'Fairy Glen Skye Scotland UK',
    'Emigrant Wilderness California',
    'Yosemite National Park',
    'Half Dome Yosemite National Park',
]

mynames = [
    'Jotuheimen Noway',
    'Fairy Gen',
    'Slettmarkmountains Jotnheimen Norway',
    'Bryce Canyon',
    'Half Domes',
]

frequency = {}
buckets = {}

def generateFrequencyDictionary():

    for geo in geonames:
        for word in geo.split(" "):
            if word not in frequency:
                frequency[word] = 0
            frequency[word] += 1


    with open("frequency.txt", "w") as f:
        w = csv.writer(f, delimiter = ' ',lineterminator='\r')
        w.writerows(frequency.items())      


def loadSpellChecker():
    global sym_spell
    initial_capacity = len(frequency)
    # maximum edit distance per dictionary precalculation
    max_edit_distance_dictionary = 4
    prefix_length = 7
    sym_spell = SymSpell(initial_capacity, max_edit_distance_dictionary,
                         prefix_length)
    # load dictionary
    dictionary_path = "frequency.txt"
    term_index = 0  # column of the term in the dictionary text file
    count_index = 1  # column of the term frequency in the dictionary text file
    if not sym_spell.load_dictionary(dictionary_path, term_index, count_index):
        print("Dictionary file not found")
        return

def splitGeoNamesIntoBuckets():
    for idx, geo in enumerate(geonames):
        for word in geo.split(" "):
            if word not in buckets:
                buckets[word] = set()
            buckets[word].add(idx)  


def string_similarity(str1, str2):
    pass

def processInput():
    for name in mynames:
        toProcess = set()
        for word in name.split(" "):
            if word not in buckets: # fix our word with a spellcheck
                max_edit_distance_lookup = 4
                suggestion_verbosity = Verbosity.CLOSEST  # TOP, CLOSEST, ALL
                suggestions = sym_spell.lookup(word, suggestion_verbosity, max_edit_distance_lookup)
                if len(suggestions):
                    word = suggestions[0].term
            if word in buckets:
                toProcess.update(buckets[word])
        for index in toProcess: # process only sentences from related buckets
            string_similarity(name, geonames[index])                    



generateFrequencyDictionary()
loadSpellChecker()
splitGeoNamesIntoBuckets()
processInput()

Thanks for this. I'll have to try it out and see how it performs compared to the other solutions given below. Might be a bit difficult to adapt to my code to test it (I don't have a dictionary but a list, and not just a single string from geoname but rather multiple information [lat, long...]), but I'll try and let you know. Thanks — LBes, Aug 27 '18 at 18:45
@LBes what dictionary? I used lists in the example. Geonames can be a list of a class or object with all your info, just need to change the code to access the name information of it. — juvian, Aug 27 '18 at 18:52
my bad I read to fast (was on the phone) and actually it seems to work almost right away. However, loadSpellChecker() seems to not work properly... It starts running the function but simply never stops and it seems that my computer is not doing anything — LBes, Aug 27 '18 at 18:57
@LBes load_dictionary processing will probably take a long time as there are many words. Maybe use a smaller dataset to test or try decreasing prefix_length and max_edit_distance_dictionary. How big is frequency.txt? How many entries does frequency have? — juvian, Aug 27 '18 at 19:06
I haven't add time to take a look at all that yet, but will make sure to do it ASAP. This is just a side project so I don't have as much time to work on it as I'd like :D — LBes, Aug 27 '18 at 19:49
just getting back to you on that. Frequency is 21MB... It has approx. 1.5 million entries. So it takes a hell of a time to load. — LBes, Aug 29 '18 at 14:21
@LBes there is a way to add them 1 by one just to check how long it takes. Might also be worth checking how long the c++ takes, as that one is the truly optimized one. Another option is to not add words with low frequency as they won´t appear often anyway — juvian, Aug 29 '18 at 14:45
indeed there are a couple of options. But I think that with that option and the other answer here, I'll be able to be much faster. I just need a few days to actually run some tests. Can't devote it much time now. Will get back to you — LBes, Aug 29 '18 at 14:46

Speeding up a "closest" string match algorithm

2 Answers2

Linked