7

I am trying to find out a faster way to lemmatize words in a list (named text) using the NLTK Word Net Lemmatizer. Apparently this is the most time consuming step in my whole program(used cProfiler to find the same).

Following is the piece of code that I am trying to optimize for speed -

def lemmed(text):
    l = len(text)
    i = 0
    wnl = WordNetLemmatizer()
    while (i<l):
        text[i] = wnl.lemmatize(text[i])
        i = i + 1
    return text

Using the lemmatizer decreases my performance by 20x. Any help would be appreciated.

  • 2
    Decreases the performance of what by 20x? What do you need lemmatized forms for? – rmalouf Jun 24 '16 at 18:27
  • @rmalouf if I remove this function my program runs _20x_ faster. I need to pre-process data before running an algorithm on it. Hence the need for lemmatized forms. – Shivansh Singh Jun 24 '16 at 18:54
  • The reason I ask is that there are faster lemmatizers/stemmers than the wordnet one, but they also give different results. The answer is going the depend on exactly what your algorithm needs as input, and on how fast is going to be fast enough for your application. It's hard to know how to answer a question like this without knowing the details about what the target is. – rmalouf Jun 24 '16 at 20:18
  • @rmalouf I would love to know about the faster lemmatizers. My input is a list of words OCR'ed from a document and I am looking to classify that document based on the words. I know the labels so it would fall under supervised learning if that helps in any way. – Shivansh Singh Jun 27 '16 at 15:56

1 Answers1

8

If you have a few cores to spare, try using the multiprocessing library:

from nltk import WordNetLemmatizer
from multiprocessing import Pool

def lemmed(text, cores=6): # tweak cores as needed
    with Pool(processes=cores) as pool:
        wnl = WordNetLemmatizer()
        result = pool.map(wnl.lemmatize, text)
    return result


sample_text = ['tests', 'friends', 'hello'] * (10 ** 6)

lemmed_text = lemmed(sample_text)

assert len(sample_text) == len(lemmed_text) == (10 ** 6) * 3

print(lemmed_text[:3])
# => ['test', 'friend', 'hello']
Alec
  • 1,399
  • 4
  • 15
  • 27
  • thank you for this amazing answer, I got a speedup by 4x_(using 6 cores)_ but what is more baffling is that the speedup is**~10x using core =1** . I am trying with different number of cores but seemingly the time pattern is 8 > 6 > 4 > 2 > 1 core, which makes no sense to me. Can you possibly explain why this pattern and how did using _Pool_ change this? – Shivansh Singh Jun 24 '16 at 20:08
  • 3
    There's usually a performance cost to context switching as you increase the number of cores or threads, and in this case execution will block until the slowest pool has finished. You'll probably see different results depending on the size of the array going in, too. Things always get murkier once you start using threads or multiple workers, which is why it always helps to save it until you really need it! – Alec Jun 24 '16 at 22:01
  • 2
    As far as to the speed up, it's possible that the direct mapping provided an improvement over the existing variable assignment. As long as it works, may as well go with it! – Alec Jun 24 '16 at 22:06
  • Thanks for the explanation @alecrasmussen. I would +1 this answer but I do not seem to have enough reputation points to do that yet. – Shivansh Singh Jun 27 '16 at 15:53
  • It took on an average 76 seconds to lemmatize a text of about 50 words using the multiprocessing Pool, but it was faster without it (2 seconds). – PinkBanter Jul 11 '19 at 07:47