7

I've created a small program that checks if authors are present in a database of authors. I haven't been able to find any specific modules for this problem, so I'm writing it from scratch using modules for approximate string matching.

The database contains around 6000 authors and is very poorly formatted (many typos, variations, titles such as "Dr.", etc). The query author list is usually between 500-1000 (and I have many of these lists), making speed quite important.

My general strategy is to trim and filter the database as much as possible and look for exact matches. If no matches are found, I move on to approximate string matching.

I'm currently using the built-in difflib.get_close_matches which does exactly what I want- however, it is extremely slow (several minutes). Therefore, I am looking for other options:

  • What is the fastest module that can return the best, say, 3 matches above some threshold in a database giving a query string?
  • What's the fastest module for comparing two strings?

The only one I have found is fuzzy wuzzy, which is even slower than difflib.

Anthon
  • 69,918
  • 32
  • 186
  • 246
Misconstruction
  • 1,839
  • 4
  • 17
  • 23
  • 3
    Did you try [editdist](http://pypi.python.org/pypi/editdist/0.1) and/or [Fast StringComparator](http://pypi.python.org/pypi/Fast%20String%20Comparator/1.0) ? They are implementations of the [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance). – Scharron Dec 21 '12 at 10:21
  • 2
    This sounds like a Levenshtein distance problem, but then the difflib implementation most likely is using this or a similar algorithm already. You might use a hash on all entries and only compare those whose hash is sufficiently close, but then it all depends on finding a good hash algorithm (if there is any!). Maybe the sum of the ASCII values of all letters would work as a very simple one. – hochl Dec 21 '12 at 10:22
  • 2
    My experience with matching a single name against about 10k names using the Levenshtein package in Python is that it's rather fast. Could you outline what's taking several minutes? Is it matching the 500-1000 "unknown" names against the 6000 "known" names? Or just one of the "unknown" names against the "known" ones? – Mike Sandford Dec 23 '12 at 00:41
  • @ hochl - I think you're right that a good strategy would be to make an initial rough sorting of the huge list. I have been looking into simply first using the SequenceMatcher.real_quick_ratio to filter out the worst matches. @Mike Sandford - the database list containts around 6.000 names. Usually a list of query names contains 500-1000 names. For each of the the names in the query list, i wish to find the best matching name in the database list (above some threshold). I'm going to look into the Levenshtein package to see if it is faster. – Misconstruction Dec 28 '12 at 08:45
  • Maybe you can also make other kind of optimizations: do "query names list" contains duplicates (not necessarily in the same file)? in this case you may use some kind of caching. Can you make some assumptions (eg. the first char in query names is correct) in order to sort/filter the list of 6000 author names? Can you improve the list of author names (e.g. removing titles/replacing non-ascii chars?) – furins Dec 28 '12 at 11:24

2 Answers2

3

Try fuzzywuzzy with the native-C python-levenshtein lib installed.

I run a benchmark on my PC for finding the best candidates of 8 words within ~19k words-list with and without C-native levenshtein backend installed (using pip install python_Levenshtein-0.12.0-cp34-none-win_amd64.whl) and i got these timings:

  • No C-backend:
    Compared 151664 words in 48.591717004776 sec (0.00032039058052521366 sec/search).
  • C-backend installed:
    Compared 151664 words in 13.034106969833374 sec (8.594067787895198e-05 sec/search).

That is ~x4 faster (but not as much as i expected).

Here are the results:

0 of 8: Compared 'Lemaire' --> `[('L.', 90), ('Le', 90), ('A', 90), ('Re', 90), ('Em', 90)]`
1 of 8: Compared 'Peil' --> `[('L.', 90), ('E.', 90), ('Pfeil', 89), ('Gampel', 76), ('Jo-pei', 76)]`
2 of 8: Compared 'Singleton' --> `[('Eto', 90), ('Ng', 90), ('Le', 90), ('to', 90), ('On', 90)]`
3 of 8: Compared 'Tagoe' --> `[('Go', 90), ('A', 90), ('T', 90), ('E.', 90), ('Sagoe', 80)]`
4 of 8: Compared 'Jgoun' --> `[('Go', 90), ('Gon', 75), ('Journo', 73), ('Jaguin', 73), ('Gounaris', 72)]`
5 of 8: Compared 'Ben' --> `[('Benfer', 90), ('Bence', 90), ('Ben-Amotz', 90), ('Beniaminov', 90), ('Benczak', 90)]`
6 of 8: Compared 'Porte' --> `[('Porter', 91), ('Portet', 91), ('Porten', 91), ('Po', 90), ('Gould-Porter', 90)]`
7 of 8: Compared 'Nyla' --> `[('L.', 90), ('A', 90), ('Sirichanya', 76), ('Neyland', 73), ('Greenleaf', 67)]`

And here is the python-code of the benchmark:

import os
import zipfile
from urllib import request as urlrequest
from fuzzywuzzy import process as fzproc
import time
import random

download_url = 'http://www.outpost9.com/files/wordlists/actor-surname.zip'
zip_name = os.path.basename(download_url)
fname, _ = os.path.splitext(zip_name)

def fuzzy_match(dictionary, search):
    nsearch = len(search)
    for i, s in enumerate(search):
        best = fzproc.extractBests(s, dictionary)
        print("%i of %i: Compared '%s' --> `%s`" % (i, nsearch, s, best))

def benchmark_fuzzy_match(wordslist, dict_split_ratio=0.9996):
    """ Shuffle and split words-list into `dictionary` and `search-words`. """
    rnd = random.Random(0)
    rnd.shuffle(wordslist)
    nwords = len(wordslist)
    ndictionary = int(dict_split_ratio * nwords)

    dictionary = wordslist[:ndictionary]
    search = wordslist[ndictionary:]
    fuzzy_match(dictionary, search)

    return ndictionary, (nwords - ndictionary)

def run_benchmark():
    if not os.path.exists(zip_name):
        urlrequest.urlretrieve(download_url, filename=zip_name)

    with zipfile.ZipFile(zip_name, 'r') as zfile:
        with zfile.open(fname) as words_file:
            blines = words_file.readlines()
            wordslist = [line.decode('ascii').strip() for line in blines]
            wordslist = wordslist[4:]  # Skip header.

            t_start = time.time()
            ndict, nsearch = benchmark_fuzzy_match(wordslist)
            t_finish = time.time()

            t_elapsed = t_finish - t_start
            ncomparisons = ndict * nsearch
            sec_per_search = t_elapsed / ncomparisons
            msg = "Compared %s words in %s sec (%s sec/search)."
            print(msg % (ncomparisons, t_elapsed, sec_per_search))

if __name__ == '__main__':
    run_benchmark()
ankostis
  • 8,579
  • 3
  • 47
  • 61
0

Python's Natural Language Toolkit (nltk) might have some additional resourcs you could try out - this google groups thread seems like a good start on that. Just an idea.

jimf
  • 4,527
  • 1
  • 16
  • 21
  • SO i finally got back to solving this problem. In the the end i used the jellyfish module, which has a nice implementation of the jaro-winkler distance, which produces results just as good as get_close_matches, and is faster. Then i used pprocess to make to comparison to the database parallel, and now comparing 5000 authors to the database around 10 seconds! – Misconstruction Feb 22 '13 at 16:23