1

I'm dealing with some nlp tasks. My inputs are french text and so, only Snowball Stemmer is usable in my context. But, unfortunately, it keeps giving me poor stems as it wouldn't remove even plural "s" or silent e. Below is some example:

from nltk.stem import SnowballStemmer
SnowballStemmer("french").stem("pommes, noisettes dorées & moelleuses, la boîte de 350g")
Output: 'pommes, noisettes dorées & moelleuses, la boîte de 350g'
alvas
  • 115,346
  • 109
  • 446
  • 738
Neroksi
  • 1,301
  • 1
  • 12
  • 20

1 Answers1

6

Stemmers stem words not sentences, so tokenize the sentence and stem the tokens individually.

>>> from nltk import word_tokenize
>>> from nltk.stem import SnowballStemmer

>>> fr = SnowballStemmer('french')

>>> sent = "pommes, noisettes dorées & moelleuses, la boîte de 350g"
>>> word_tokenize(sent)
['pommes', ',', 'noisettes', 'dorées', '&', 'moelleuses', ',', 'la', 'boîte', 'de', '350g']

>>> [fr.stem(word) for word in word_tokenize(sent)]
['pomm', ',', 'noiset', 'dor', '&', 'moelleux', ',', 'la', 'boît', 'de', '350g']

>>> ' '.join([fr.stem(word) for word in word_tokenize(sent)])
'pomm , noiset dor & moelleux , la boît de 350g'
alvas
  • 115,346
  • 109
  • 446
  • 738
  • Thank you @Alavs, understood forever! I was used to doing NLP in R. My conclusion is that there are very few vectorized solutions in the nltk package. – Neroksi Jul 01 '18 at 12:53