is there a way to clean text with typos in a column to make these text items the same as in reference column which contain the same items?

Question

I have two excel files each file contains columns [product name, price, quantity]. one is the reference table (drug_list) which contain correct (product name) items (14000 item) however the other file (order-1) contains (product name) items (3000 item) with typos. i need to make lookup between products in order-1 file and reference file (drug list) however the typos in order-1's product name is too much. I've tried first to use fuzzywuzzy in python to make matching between the two columns but it takes too much time more than 16 hours till now and not finished yet. Is there a way?

for prod in drug_list['product']:
    matches=process.extract(prod,order_1['product'],limit=order_1.shape[0])

for best_match in matches:
    if best_match[1]>= 0.85:
        order_1.loc[order_1['product']==best_match[0],'product']=prod

You need to provide a [mre], including a sample data set. – relent95 Oct 16 '22 at 03:04 — relent95, Oct 16 '22 at 03:04
You set the ```limit``` option to too large value. Try 1. – relent95 Oct 16 '22 at 03:42 — relent95, Oct 16 '22 at 03:42

score 0 · Answer 1 · answered Oct 16 '22 at 06:18

You could try to use rapidfuzz for this similar to the solution I described in https://stackoverflow.com/a/73973651/11335032. Using process.cdist creates a complete matrix of all scores. In your case this only requires around 14k * 3k * 4 bytes = 160mb of memory. However in case you work with larger datasets you might need to work with smaller chunks as described in the referenced answer above. For your problem the following should work:

from rapidfuzz import process

scores = process.cdist(
    drug_list['product'], order_1['product'],
    score_cutoff=85, workers=-1)

Note that by default process.cdist uses scorer=fuzz.ratio and processor=None. To achieve the same results as your solution with process.extract use:

from rapidfuzz import process, fuzz, utils

scores = process.cdist(
    drug_list['product'], order_1['product'],
    scorer=fuzz.WRatio, processor=utils.default_process,
    score_cutoff=85, workers=-1)

This is slower, but depending on your dataset might lead to better results depending on your dataset.

is there a way to clean text with typos in a column to make these text items the same as in reference column which contain the same items?

1 Answers1