I made the following function to clean the text notes of my dataset :
import spacy
nlp = spacy.load("en")
def clean(text):
"""
Text preprocessing for english text
"""
# Apply spacy to the text
doc=nlp(text)
# Lemmatization, remotion of noise (stopwords, digit, puntuaction and singol characters)
tokens=[token.lemma_.strip() for token in doc if
not token.is_stop and not nlp.vocab[token.lemma_].is_stop # Remotion StopWords
and not token.is_punct # Remove puntuaction
and not token.is_digit # Remove digit
]
# Recreation of the text
text=" ".join(tokens)
return text.lower()
Problem is when I want to clean all my dataset text, it take hour and hour. (my dataset is 70k row and between 100 to 5000 words per row)
I tried to use swifter
to run the apply
method on multiplethread like that : data.note_line_comment.swifter.apply(clean)
But it doesn't made really better as it took almost one hour.
I was wondering if there is any way to make a vectorized form of my function or maybe and other way to speed up the process. Any idea ?