FuzzyWuzzy search using Asian characters

Question

below code from good Samaritan - works great in English, can find strings of text in a large document and get confidence on how well it matches

but cant figure out how to get it working with Thai characters

#!/usr/bin/python

from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs


with open('mainEN.txt', 'r') as hay_file:
    hay = hay_file.read()

with open('searchEN.txt', 'r') as needle_file:
    needle = needle_file.read()

needle_length  = len(needle.split())
max_sim_val    = 0
max_sim_string = u""

for ngram in ngrams(hay.split(), needle_length + int(.2*needle_length)):
    hay_ngram = u" ".join(ngram)
    similarity = SM(None, hay_ngram, needle).ratio() 
    if similarity > max_sim_val:
        max_sim_val = similarity
        max_sim_string = hay_ngram

print max_sim_val, max_sim_string

Thanks.....even some of the best translation engines still perform poorly, so would not work IMHO — TinkyWinkyMD, Dec 12 '18 at 14:06
So, you are answering your own question..if the Google translation itself is poor think about how can someone write a package which can do fuzzywuzzy in Thai — Rahul Agarwal, Dec 12 '18 at 14:47
When u translate may be active voice is changing to passive or the contextual meaning may be lost. But you need to do fuzzywuzzy that means you need to just match particular words or partial match. So, IMHO you should try Eng. translation and check how much accuracy you are getting — Rahul Agarwal, Dec 12 '18 at 14:49
Cheers.... doe Fuzzy only do english ? if so, then I understand it. — TinkyWinkyMD, Dec 13 '18 at 16:39
Not sure!! But most of the packages are writen keeping English language in mind and then for certain support language packages are provided. Example spacey package. Hence, I am not sure about fuzzywuzzy in Thai or any other language — Rahul Agarwal, Dec 13 '18 at 18:19

FuzzyWuzzy search using Asian characters

0 Answers0