I'm working on an NLP project. I have a massive dataset of 180 million words. Before I begin training I want to correct the spelling of words. To do this I use TextBlob's spell correct. Because TextBlob takes a while to process anyways, it would be an insanely long amount of time to correct the spelling of 180 million words. So here is my approach (code will follow after this):
- Load corpus
- Split the corpus into list of sentences using nltk tokenizer
- Multiprocessing: apply function to every iterable item of list generated from step 2
Here is my code:
import codecs
import multiprocessing
import nltk
from textblob import TextBlob
from nltk.tokenize import sent_tokenize
class SpellCorrect():
def __init__(self):
pass
def load_data(self, path):
with codecs.open(path, "r", "utf-8") as file:
data = file.read()
return sent_tokenize(data)
def correct_spelling(self, data):
data = TextBlob(data)
return str(data.correct())
def merge_cleaned_corpus(self, result, path):
result = " ".join(temp for temp in result)
with codecs.open(path, "a", "utf-8") as file:
file.write(result)
if __name__ == "__main__":
SpellCorrect = SpellCorrect()
data = SpellCorrect.load_data(path)
correct_spelling = SpellCorrect.correct_spelling
pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
result = pool.apply_async(correct_spelling, (data, ))
result = result.get()
SpellCorrect.merge_cleaned_corpus(tuple(result), path)
When I run this, I get the following error:
_pickle.PicklingError: Can't pickle <class '__main__.SpellCorrect'>: it's not the same object as __main__.SpellCorrect
This error is generated at the line in my code that says result = result.get()
From my probably wrong guess, I'm guessing that the parallel processing component completed successfully and was able to apply my clean up to every iterable sentence. However, I'm unable to retrieve those results.
Can someone tell my why this error is being generated, and what can I do to fix it. Thanks in advance!