7

Assume I have a string text = "A compiler translates code from a source language". I want to do two things:

  1. I need to iterate through each word and stem using the NLTK library. The function for stemming is PorterStemmer().stem_word(word). We have to pass the argument 'word'. How can I stem each word and get back the stemmed sentence?

  2. I need to remove certain stop words from the text string. The list containing the stop words is stored in a text file (space separated)

    stopwordsfile = open('c:/stopwordlist.txt','r+')
    stopwordslist=stopwordsfile.read()
    

    How can I remove those stop words from text and get a cleaned new string?

MERose
  • 4,048
  • 7
  • 53
  • 79
ChamingaD
  • 2,908
  • 8
  • 35
  • 58
  • `for word in text.split(' '): stemmer.stem_word(word)`? – wkl May 08 '12 at 20:12
  • stemmed = for word in text.split(' '): stemmer.stem_word(word) will this work ? – ChamingaD May 08 '12 at 20:15
  • 1
    Not exactly. If you want a list of the stems, you could do `stemmed = [stemmer.stem_word(w) for w in text.split(' ')]`. If you want a sentence of it, you can then do `sente = ' '.join(stemmed)`, which will return a sentence of all the stems. Let me know if that helps. – wkl May 08 '12 at 20:18
  • @birryree Thanks :) I did it with " ".join(PorterStemmer().stem_word(word) for word in text.split(" ")) – ChamingaD May 08 '12 at 20:20

2 Answers2

10

I posted this as a comment, but thought I might as well flesh it out into a full answer with some explanation:

You want to use str.split() to split the string into words, and then stem each word:

for word in text.split(" "):
    PorterStemmer().stem_word(word)

As you want to get a string of all the stemmed words together, it's trivial to then join these stems back together. To do this easily and efficiently we use str.join() and a generator expression:

" ".join(PorterStemmer().stem_word(word) for word in text.split(" "))

Edit:

For your other problem:

with open("/path/to/file.txt") as f:
    words = set(f)

Here we open the file using the with statement (which is the best way to open files, as it handles closing them correctly, even on exceptions, and is more readable) and read the contents into a set. We use a set as we don't care about the order of the words, or duplicates, and it will be more efficient later. I am presuming one word per line - if this isn't the case, and they are comma separated, or whitespace separated then using str.split() as we did before (with appropriate arguments) is probably a good plan.

stems = (PorterStemmer().stem_word(word) for word in text.split(" "))
" ".join(stem for stem in stems if stem not in words)

Here we use the if clause of a generator expression to ignore words that are in the set of words we loaded from a file. Membership checks on a set are O(1), so this should be relatively efficient.

Edit 2:

To remove the words before they are stemmed, it's even simpler:

" ".join(PorterStemmer().stem_word(word) for word in text.split(" ") if word not in words)

The removal of the given words is simply:

filtered_words = [word for word in unfiltered_words if not in set_of_words_to_filter]
Gareth Latty
  • 86,389
  • 17
  • 178
  • 183
  • I need to do another thing. To remove stop words from that string. List of Stopwords stored in text file (space separated) stopwordsfile = open('c:/stopwordlist.txt','r+') stopwordslist=stopwordsfile.read() I need to remove those stop words from `text` and get cleaned new string. – ChamingaD May 08 '12 at 20:27
  • @ChamingaD I would suggest that is a different problem and you should open a new question. If you do that, it will be more helpful to other people in future with a similar problem, and easier for us to work with. – Gareth Latty May 08 '12 at 20:29
  • Problem is i have to wait another 20 minutes to start new qustion :/ – ChamingaD May 08 '12 at 20:31
  • @ChamingaD I've added an answer here for this case. In future, however, posting a separate question is the better solution. – Gareth Latty May 08 '12 at 20:37
  • Thanks a lot :) can I get stopword removal as separate code ? (first i'll remove stopwords then stemming) – ChamingaD May 08 '12 at 20:40
  • @ChamingaD added, in future, you might want to be clearer about what you want. – Gareth Latty May 08 '12 at 20:43
  • Yea, Seems my explanation is poor :/. I need two separate code segments for removing stopwords and stemming (two for loops separated). Because I have to check how each technique affect final output ;) – ChamingaD May 08 '12 at 20:51
  • filtered_words = [word for word in unfiltered_words if not in set_of_words_to_filter] .. In this code unfiltered_words = text and set_of_words_to_filter = stopwordslist ? My stop word list is space separated so will this work ? – ChamingaD May 08 '12 at 21:40
4

To go through on each word in the string:

for word in text.split():
    PorterStemmer().stem_word(word)

Use string's join method (recommended by Lattyware) to concatenate pieces to one big string.

" ".join(PorterStemmer().stem_word(word) for word in text.split(" "))
  • 2
    The question does ask 'and get a stemmed sentence back' so a full answer would be `" ".join(PorterStemmer().stem_word(word) for word in text.split(" "))`. – Gareth Latty May 08 '12 at 20:15