-1

I am using Python NLTK library to perform stemming on a big corpus. I am doing following

text = [porter.stem(token) for token in text.split()] 
text = ' '.join(text)

"text" is representing one row of my file. I have millions of rows in my file, and this process is taking huge amount of time. I just want to ask is there any better method to do this operation?

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
Sangeeta
  • 589
  • 1
  • 7
  • 26
  • 1
    Can you give more information? The only thing I can say about your code is that maybe `text = ' '.join(porter.stem(token) for token in text.split())` could be a bit faster, but it wont give a big increase in efficiency. Without the whole loop we can't say much more than "millions of lines is a big quantity of data which implies big processing times". – Bakuriu Dec 10 '12 at 19:54
  • Hi! Thanks! for reply. What information do you need? – Sangeeta Dec 10 '12 at 20:01
  • 1
    You say that `text` is just a line and you are processing millions of lines, then could you show how the whole thing is done? Yes, it's true that generally optimizing the parts inside a loop is better(because they are executed more times), but in this case you simply can't do much better, and so you should try to optimizing the other parts of the loop. – Bakuriu Dec 10 '12 at 20:06

1 Answers1

1

How many is "millions" and how long is a "huge amount of time"? Porter stemming isn't a complicated algorithm and should be reasonably quick. I suspect you're I/O limited rather than anything else. Still... there may be some improvements you can eke out.

If order is not important and you don't need every copy of each stem, you may find it simpler (and more memory-efficient) to use a dictionary and/or set to store your stems. This will let you avoid needing to stem words you've already seen, which should improve performance, and store each stem only once.

For example:

seenwords = set()
seenstems = set()

for line in input_file:
    line = line.lower().split()
    seenstems.union(porter.stem(token) for token in line if token not in seenwords)
    seenwords.union(line)

This can stem words more than once if they're on the same line, but for subsequent lines they won't need to be stemmed any longer. You could also process the words one by one, which will avoid stemming them multiple times in the same line, but there's some speed advantage in using the generator expression rather than a for loop.

kindall
  • 178,883
  • 35
  • 278
  • 309