0

I made the following function to clean the text notes of my dataset :

import spacy
nlp = spacy.load("en")
def clean(text):
    """
    Text preprocessing for english text
    """
    # Apply spacy to the text
    doc=nlp(text)
    # Lemmatization, remotion of noise (stopwords, digit, puntuaction and singol characters)
    tokens=[token.lemma_.strip() for token in doc if 
            not token.is_stop and not nlp.vocab[token.lemma_].is_stop # Remotion StopWords
            and not token.is_punct # Remove puntuaction
            and not token.is_digit # Remove digit
           ]
    # Recreation of the text
    text=" ".join(tokens)

    return text.lower()

Problem is when I want to clean all my dataset text, it take hour and hour. (my dataset is 70k row and between 100 to 5000 words per row)

I tried to use swifter to run the apply method on multiplethread like that : data.note_line_comment.swifter.apply(clean)

But it doesn't made really better as it took almost one hour.

I was wondering if there is any way to make a vectorized form of my function or maybe and other way to speed up the process. Any idea ?

Yohann L.
  • 1,262
  • 13
  • 27
  • What are you using for lemmatization? Is it your custom written lemmatizer? – Alexander Rossa Apr 03 '19 at 15:37
  • @AlexanderRossa yes I use spacy and just remove stopwords, digit, punctuation and single characters, then I join back the tokens – Yohann L. Apr 03 '19 at 15:38
  • 2
    This problem takes a while by definition. Lemmatizing and stemmizing words is costly, and actually working with strings in general is costly. The idea is that you only need to do the cleaning *once*, such that from that point on you just load the already cleaned data. So even if it takes a whole day to do it, that's alright. If you absolutely need performance or speedup the timing in your cleaning, you might consider some parallel processing. I don't know you stack and the infrastructure you work in, but you might consider splitting the data and running in different computers simultaneouslyetc – rafaelc Apr 03 '19 at 17:08
  • 1
    @RafaelC thanks for your answer, very interesting. By the meantime I've found out that you can disable some pipeline component in SpaCy. By using `nlp = spacy.load("en", disable=['ner', 'parser', 'tagger', 'textcat'])` it took now only 20min, and as you said I don't need to do this job every time. – Yohann L. Apr 03 '19 at 17:25
  • Nice to hear that ! Your data is small (~70k rows) so it helps a lot too ;} – rafaelc Apr 03 '19 at 17:26

1 Answers1

1

Short answer

This type of problem inherently takes time.

Long answer

  • Use regular expressions
  • Change the spacy pipeline

The more information about the strings you need to make a decision, the longer it will take.

Good news is, if your cleaning of the text is relatively simplified, a few regular expressions might do the trick.

Otherwise you are using the spacy pipeline to help remove bits of text which is costly since it does many things by default:

  1. Tokenisation
  2. Lemmatisation
  3. Dependency parsing
  4. NER
  5. Chunking

Alternatively, you can try your task again and turn off the aspects of the spacy pipeline you don't want which may speed it up quite a bit.

For example, maybe turn off named entity recognition, tagging and dependency parsing...

nlp = spacy.load("en", disable=["parser", "tagger", "ner"])

Then try again, it will speed up.

Nathan McCoy
  • 3,092
  • 1
  • 24
  • 46