Stemming process not working in Python

Question

I have a text file that I am trying to stem after having removed stopwords but it seems that nothing changes when I run it. My file is called data0.

Here are my codes:

## Removing stopwords and tokenizing by words (split each word)
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

data0 = word_tokenize(data0)
data0 = ' '.join([word for word in data0 if word not in (stopwords.words('english'))])
print(data0)

## Stemming the data
from nltk.stem import PorterStemmer

ps = PorterStemmer()
data0 = ps.stem(data0)
print(data0)

And I get the following (wrapped for legibility):

For us around Aberdeen , question `` What oil industry ? ( Evening Express , October 26 ) touch deja vu . That question asked almost since day first drop oil pumped North Sea . In past 30 years seen constant cycle ups downs , booms busts industry . I predict happen next . There period worry uncertainty scrabble find something keep local economy buoyant oil gone . Then upturn see jobs investment oil , everyone breathe sigh relief quest diversify go back burner . That downfall . Major industries prone collapse . Look nation 's defunct shipyards extinct coal steel industries . That 's vital n't panic downturns , start planning sensibly future . Our civic business leaders need constantly looking something secure prosperity - tourism , technology , bio-science emerging industries . We need economically strong rather waiting see happens oil roller coaster hits buffers . N JonesEllon

The first part of the code works fine (Removing stopwords and tokenizing), but us the second part (Stem) which does not work. Any idea why?

Could you update your question with the output you get. What 'does not work' about it? — Tom Rees, Apr 01 '16 at 11:06

Tom Rees · Accepted Answer · 2016-04-01T12:44:56.190

From the Stemmer docs http://www.nltk.org/howto/stem.html, it looks like the Stemmer is designed to be called on a single word at a time.

Try running it on each word in

[word for word in data0 if word not in (stopwords.words('english'))]

I.e. before calling join

E.g.

stemmed_list = []
for str in [word for word in data0 if word not in (stopwords.words('english'))]:
    stemmed_list.append(ps.stem(str))

Edit: Comment Response. I ran the following - and it seemed to stem correctly:

>>> from nltk.stem import PorterStemmer
>>> ps = PorterStemmer()
>>> data0 = '''<Your Data0 string>'''
>>> words = data0.split(" ")
>>> stemmed_words = map(ps.stem, words)
>>> print(list(stemmed_words))  # list cast needed because of 'map'
[..., 'industri', ..., 'diversifi']

I don't think there is a stemmer that can be applied straight to text, but you can wrap it in your own function that takes 'ps' and the text:

def my_stem(text, stemmer):
    words = text.split(" ")
    stemmed_words = map(stemmer, words)
    result = " ".join(list(stemmed_words))
    return result

@Tom Rees, the good looks great but it seems that still gives me the same outcome. Since the PorterStemmer is designed to be called on a single word at a time, is there other stemmer functions that can be applied to whole texts? Thanks — Economist_Ayahuasca, Apr 01 '16 at 12:26
@AndresAzqueta, I've edited the answer in response to your comment. Hope that helps. — Tom Rees, Apr 01 '16 at 12:45

score 1 · Answer 2 · answered Apr 01 '16 at 12:48

1

Here's what I've done in the past w/NLTK:

st = PorterStemmer()

def stem_tokens(tokens):
    for item in tokens:
        yield st.stem(item)

def go(text):
    tokens = nltk.word_tokenize(text)

    return ' '.join(stem_tokens(tokens))

answered Apr 01 '16 at 12:48

Brian Cain

14,403
3
50
88

Stemming process not working in Python

2 Answers2