Finding best matched word from large Vocalist

Question

I have a pandas data frame that contains two columns named Potential Word, Fixed Word. The Potential Word column contains words of different languages which contains spell mistakes words and correct words and the Fixed Word column contains the correct words corresponded to Potential Word.

Below I have shared some of the samples data

Potential Word	Fixed Word
Exemple	Example
pipol	People
pimple	Pimple
Iunik	unique

My vocab Dataframe contains 600K unique row.

My Solution:

key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
    match_value = match(each, key) # match is a function that returns a 
    # similarity value of two strings
    if match_value > glob_match_value and match_value > match_threshold:
        glob_match_value = match_value
        potential_fixed_word = each

Problem

The problem with my code its takes a lot of time to fix every word because of the loop running through the large vocab list. When a word is missed on the vocab then it takes almost 5 or 6 sec to solve a sentence of 10 ~12 words. The match function performs decently so the objective of the optimization.

I need optimized solution help me here

Maybe you can do better with `match` function. Can you share? — ghchoi, Feb 09 '21 at 09:33
use sequence matcher https://pypi.org/project/fuzzy-sequence-matcher/ — Tarequzzaman Khan, Feb 09 '21 at 10:04
Yes Some time One 100 `Potential word` contains the same fixed word — Tarequzzaman Khan, Feb 09 '21 at 10:44
I have used many matchers for this. The one that was satifying enough for this specific task was this one. https://github.com/python/cpython/blob/3.9/Lib/difflib.py — Tarequzzaman Khan, Feb 09 '21 at 10:56

ghchoi · Answer 1 · 2021-02-09T11:41:05.640

In the perspective of Information Retrieval (IR), You need to reduce search space. Matching the given_word (as key) against all Potential Words is definitely inefficient. Instead, you need to match against a reasonable number of candidates.

To find such candidates, you need to index Potential Words and Fixed Words.

from whoosh.analysis import StandardAnalyzer
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in

ix = create_in("indexdir", Schema(
    potential=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True),
    fixed=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True)
))
writer = ix.writer()
writer.add_document(potential='E x e m p l e', fixed='Example')
writer.add_document(potential='p i p o l', fixed='People')
writer.add_document(potential='p i m p l e', fixed='Pimple')
writer.add_document(potential='l u n i k', fixed='unique')
writer.commit()

With this index, you can search some candidates.

from whoosh.qparser import SimpleParser

with ix.searcher() as searcher:
    results = searcher.search(SimpleParser('potential', ix.schema).parse('p i p o l'))
    for result in results[:2]:
        print(result)

The output is

<Hit {'fixed': 'People', 'potential': 'p i p o l'}>
<Hit {'fixed': 'Pimple', 'potential': 'p i m p l e'}>

Now, you can match the given_word only against few candidates, instead of all 600K.

It is not perfect, however, this is inevitable trade-off and how IR essentially works. Try this with different numbers of candidates.

nimish · Answer 2 · 2021-02-18T11:42:21.750

Without changing much on your implementation, as I believe it is somewhat required to iterate over the list of potential words for each word.

Here my intention is not to optimize the match function itself but to leverage multiple threads to search in parallel.

import concurrent.futures
import time
from concurrent.futures.thread import ThreadPoolExecutor
from typing import Any, Union, Iterator

import pandas as pd

# Replace your dataframe here for testing this

df = pd.DataFrame({'Potential Word': ["a", "b", "c"], "Fixed Word": ["a", "c", "b"]})

# Replace by your match function

def match(w1, w2):
    # Simulate some work is happening here
    time.sleep(1)
    return 1

# This is mostly your function itself
# Using index to recreate the sentence from the returned values
def matcher(idx, given_word):
    key = given_word
    glob_match_value = 0
    potential_fixed_word = ''
    match_threshold = 0.65
    for each in df['Potential Word']:
        match_value = match(each, key)  # match is a function that returns a
        # similarity value of two strings
        if match_value > glob_match_value and match_value > match_threshold:
            glob_match_value = match_value
            potential_fixed_word = each
            return idx, potential_fixed_word
        else:
            # Handling default case, you might want to change this
            return idx, ""


sentence = "match is a function that returns a similarity value of two strings match is a function that returns a " \
           "similarity value of two strings"

start = time.time()

# Using a threadpool executor 
# You can increase or decrease the max_workers based on your machine
executor: Union[ThreadPoolExecutor, Any]
with concurrent.futures.ThreadPoolExecutor(max_workers=24) as executor:
    futures: Iterator[Union[str, Any]] = executor.map(matcher, list(range(len(sentence.split()))), sentence.split())

# Joining back the input sentence
out_sentence = " ".join(x[1] for x in sorted(futures, key=lambda x: x[0]))
print(out_sentence)
print(time.time() - start)

Please note that the runtime of this will depend upon

The maximum time taken by a single match call
Number of words in the sentence
Number of worker threads (TIP: Try to see if you can use as many as the number of words in the sentence)

score 0 · Answer 3 · answered Feb 09 '21 at 08:39

0

I would use the sortedcollections module. In general the access time to a SortedList or SortedDict is O(log(n)) instead of O(n); In your case 19.1946 if/then checks vs 600,000 if/then checks.

from sortedcollections import SortedDict

answered Feb 09 '21 at 08:39

Bobby Ocean

3,120
1
8
15

It seems not about exact matching problem. – ghchoi Feb 09 '21 at 09:22
Bobby Ocean The problem is not related to Sorted Dict – Tarequzzaman Khan Feb 09 '21 at 10:38
What do you mean? Your line "for each in df['Potential Word']" if checking 600,000 values, a SortedList only needs to check 19 values to determine if a string is in the list or not. To be clear, "item in SortedList" (<19 computations), "item in df["Potential Word"]" (<600,000 computations). – Bobby Ocean Feb 09 '21 at 20:57

Finding best matched word from large Vocalist

3 Answers3