Multiprocessing error for NLP application

Question

I'm working on an NLP project. I have a massive dataset of 180 million words. Before I begin training I want to correct the spelling of words. To do this I use TextBlob's spell correct. Because TextBlob takes a while to process anyways, it would be an insanely long amount of time to correct the spelling of 180 million words. So here is my approach (code will follow after this):

Load corpus
Split the corpus into list of sentences using nltk tokenizer
Multiprocessing: apply function to every iterable item of list generated from step 2

Here is my code:

import codecs
import multiprocessing
import nltk

from textblob import TextBlob
from nltk.tokenize import sent_tokenize

class SpellCorrect():

     def __init__(self):
         pass

     def load_data(self, path):
         with codecs.open(path, "r", "utf-8") as file:
             data = file.read()
         return sent_tokenize(data)

     def correct_spelling(self, data):
         data = TextBlob(data)
         return str(data.correct())

     def merge_cleaned_corpus(self, result, path):
         result = " ".join(temp for temp in result)
         with codecs.open(path, "a", "utf-8") as file:
             file.write(result)

if __name__ == "__main__":
    SpellCorrect = SpellCorrect()
    data = SpellCorrect.load_data(path)
    correct_spelling = SpellCorrect.correct_spelling
    pool = multiprocessing.Pool(processes = multiprocessing.cpu_count())
    result = pool.apply_async(correct_spelling, (data, ))
    result = result.get()
    SpellCorrect.merge_cleaned_corpus(tuple(result), path)

When I run this, I get the following error:

_pickle.PicklingError: Can't pickle <class '__main__.SpellCorrect'>: it's not the same object as __main__.SpellCorrect

This error is generated at the line in my code that says result = result.get()

From my probably wrong guess, I'm guessing that the parallel processing component completed successfully and was able to apply my clean up to every iterable sentence. However, I'm unable to retrieve those results.

Can someone tell my why this error is being generated, and what can I do to fix it. Thanks in advance!

Multiprocessing error for NLP application

0 Answers0