I have a big file with a list of words, some of which are not exactly correct. Ex -- (moistuizer for moisturizer, frm for from, farrr for far etc). I am using the SequenceMatcher in python to distinguish between these words and get me the correct one. Here is my code.
for pair in list(itertools.combinations(index_list,2)):
s1 = stop_tag.term[pair[0]]
s2 = stop_tag.term[pair[1]]
m = SequenceMatcher(None, s1, s2) #Similarity
if m.ratio() > 0.90:
pair_list.append(pair)
stop_tag
is the dataframe and term
is the name of the column.
The problem is the dataframe is pretty large and when i run this, python throws me an error like this :
Traceback (most recent call last):
File "C:/Users/admin/Pycharm/term_freq_v2.py", line 58, in <module>
for i in list(itertools.combinations(index_list,2)):
MemoryError
Is it because difflib cant handle large amount of data? Because i have read some posts here which have worked with data sets larger than mine. Is there a work around for this? Can i use any other library instead of difflib?? I have pretty new to python, so not really sure what this is. Any help would be appreciated.