What should be the outcome of stemming a word with apostrophe?

Question

I'm using nltk.stem.porter.PorterStemmer in python to get stems of words.

When I get the stem of "women" and "women's" I get different results respectively: "women" and "women'". For my purposes I need to have both words having the same stem.

In my line of thought both words refer to the same idea/concept and are pretty much the same word suffering a transformation so they should have the same stem.

Why am I getting two different results? Is this correct?

I think you should first tokenize your text removing all punctuation. — giograno, Jan 27 '16 at 09:56

score 3 · Accepted Answer · answered Jan 27 '16 at 15:31

It's necessary to tokenize your text before lemmatizing.

Without tokenization:

>>> from nltk import word_tokenize
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()

>>> [wnl.lemmatize(i) for i in "the woman's going home".split()]
['the', "woman's", 'going', 'home']
>>> [wnl.lemmatize(i) for i in "the women's home is in London".split()]
['the', "women's", 'home', 'is', 'in', 'London']

With tokenization:

>>> [wnl.lemmatize(i) for i in word_tokenize("the woman's going home")]
['the', 'woman', "'s", 'going', 'home']
>>> [wnl.lemmatize(i) for i in word_tokenize("the women's home is in London")]
['the', u'woman', "'s", 'home', 'is', 'in', 'London']

What should be the outcome of stemming a word with apostrophe?

1 Answers1