0

I have a big file with a list of words, some of which are not exactly correct. Ex -- (moistuizer for moisturizer, frm for from, farrr for far etc). I am using the SequenceMatcher in python to distinguish between these words and get me the correct one. Here is my code.

for pair in list(itertools.combinations(index_list,2)):
    s1 = stop_tag.term[pair[0]]
    s2 = stop_tag.term[pair[1]]
    m = SequenceMatcher(None, s1, s2)       #Similarity
    if m.ratio() > 0.90:
        pair_list.append(pair)  

stop_tag is the dataframe and term is the name of the column. The problem is the dataframe is pretty large and when i run this, python throws me an error like this :

Traceback (most recent call last):
  File "C:/Users/admin/Pycharm/term_freq_v2.py", line 58, in <module>
    for i in list(itertools.combinations(index_list,2)):
MemoryError

Is it because difflib cant handle large amount of data? Because i have read some posts here which have worked with data sets larger than mine. Is there a work around for this? Can i use any other library instead of difflib?? I have pretty new to python, so not really sure what this is. Any help would be appreciated.

M PAUL
  • 1,228
  • 2
  • 13
  • 21
  • 1
    Nothing to do with `SequenceMatcher` - it's the fact you're trying to build a list in memory of *all* the combinations before you apply any filtering. Remove the `list(...)` and iterate over `itertools.combinations` directly instead. – Jon Clements Aug 02 '16 at 06:30
  • it works, thanks. I removed the list and used difflib.SequenceMatcher instead of SequenceMatcher. – M PAUL Aug 02 '16 at 08:30

0 Answers0