How to exclude certain names and terms from stemming (Python NLTK SnowballStemmer (Porter2))

Question

I am newly getting into NLP, Python, and posting on Stackoverflow at the same time, so please be patient with me if I might seem ignorant :).

I am using SnowballStemmer in Python's NLTK in order to stem words for textual analysis. While lemmatization seems to understem my tokens, the snowball porter2 stemmer, which I read is mostly preferred to the basic porter stemmer, overstems my tokens. I am analyzing tweets including many names and probably also places and other words which should not be stemmed, like: hillary, hannity, president, which are now reduced to hillari, hanniti, and presid (you probably guessed already whose tweets I am analyzing).

Is there an easy way to exclude certain terms from stemming? Conversely, I could also merely lemmatize tokens and include a rule for common suffixes like -ed, -s, …. Another idea might be to merely stem verbs and adjectives as well as nouns ending in s. That might also be close enough…

I am using below code as of now:

# LEMMATIZE AND STEM WORDS

from nltk.stem.snowball import EnglishStemmer

lemmatizer = nltk.stem.WordNetLemmatizer()
snowball = EnglishStemmer() 

def lemmatize_text(text):

    return [lemmatizer.lemmatize(w) for w in text]

def snowball_stemmer(text):

    return [snowball.stem(w) for w in text]

# APPLY FUNCTIONS

tweets['text_snowball'] = tweets.text_processed.apply(snowball_stemmer)
tweets['text_lemma'] = tweets.text_processed.apply(lemmatize_text)

I hope someone can help… Contrary to my past experience with all kinds of issues, I have not been able to find adequate help for my issue online so far.

Thanks!

score 2 · Accepted Answer · answered Dec 10 '19 at 12:04

2

Do you know NER? It means named entity recognition. You can preprocess your text and locate all named entities, which you then exclude from stemming. After stemming, you can merge the data again.

answered Dec 10 '19 at 12:04

CLpragmatics

625
6
21

I did not, but will look into it. Thank you! – ylimenibor Dec 10 '19 at 12:32
You're welcome. If you have problems, either comment again here or ask a new question. – CLpragmatics Dec 10 '19 at 13:24

How to exclude certain names and terms from stemming (Python NLTK SnowballStemmer (Porter2))

1 Answers1