Snowball Stemmer : poor french stemming

Question

I'm dealing with some nlp tasks. My inputs are french text and so, only Snowball Stemmer is usable in my context. But, unfortunately, it keeps giving me poor stems as it wouldn't remove even plural "s" or silent e. Below is some example:

from nltk.stem import SnowballStemmer
SnowballStemmer("french").stem("pommes, noisettes dorées & moelleuses, la boîte de 350g")
Output: 'pommes, noisettes dorées & moelleuses, la boîte de 350g'

Put all the words of the sentence in a for loop, and check for each. Then add them all back to the string. — Eshita Shukla, Jul 01 '18 at 12:01

score 6 · Accepted Answer · answered Jun 29 '18 at 18:02

Stemmers stem words not sentences, so tokenize the sentence and stem the tokens individually.

>>> from nltk import word_tokenize
>>> from nltk.stem import SnowballStemmer

>>> fr = SnowballStemmer('french')

>>> sent = "pommes, noisettes dorées & moelleuses, la boîte de 350g"
>>> word_tokenize(sent)
['pommes', ',', 'noisettes', 'dorées', '&', 'moelleuses', ',', 'la', 'boîte', 'de', '350g']

>>> [fr.stem(word) for word in word_tokenize(sent)]
['pomm', ',', 'noiset', 'dor', '&', 'moelleux', ',', 'la', 'boît', 'de', '350g']

>>> ' '.join([fr.stem(word) for word in word_tokenize(sent)])
'pomm , noiset dor & moelleux , la boît de 350g'

Thank you @Alavs, understood forever! I was used to doing NLP in R. My conclusion is that there are very few vectorized solutions in the nltk package. — Neroksi, Jul 01 '18 at 12:53

Snowball Stemmer : poor french stemming

1 Answers1