-1

I have a text document i need to use stemming and Lemmatization on. I have already cleaned the data and tokenised it as well as removing stop words

what i need to do is take the list as an input and return a dict and the dict should have the keys 'original stem and lemmma. and the values being the nth word transformed in that way

  snowball stemmer is defined as Stemmer()
  and WordNetLemmatizer is defined as lemmatizer()

heres the code ive written but it does give our an error

def find_roots(token_list, n):
n = 2
original = tokens
stem = [ele for sub in original for idx, ele in 
enumerate(sub.split()) if idx == (n - 1)]
stem = stemmer(stem)
lemma = [ele for sub in original for idx, ele in 
enumerate(sub.split()) if idx == (n - 1)]
lemma = lemmatizer()
return 

Any help would be appreciated

Retsukki
  • 3
  • 2

2 Answers2

0

I really don't understand what you are trying to do in the list comprehensions, so I'll just write how I would do it:

from nltk import WordNetLemmatizer, SnowballStemmer

lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")


def find_roots(token_list, n):
    token = token_list[n]
    stem = stemmer.stem(token)
    lemma = lemmatizer.lemmatize(token)
    return {"original": token, "stem": stem, "lemma": lemma}


roots_dict = find_roots(["said", "talked", "walked"], n=2)
print(roots_dict)
> {'original': 'walked', 'stem': 'walk', 'lemma': 'walked'}
ewz93
  • 2,444
  • 1
  • 4
  • 12
0

You can do what you want with spacy like below: (In many cases spacy performs better than nltk.)

# $ pip install -U spacy

import spacy
from nltk import WordNetLemmatizer, SnowballStemmer

sp = spacy.load('en_core_web_sm')
lemmatizer = WordNetLemmatizer()
stemmer = SnowballStemmer("english")


words = ['compute', 'computer', 'computed', 'computing', 'said', 'talked', 'walked']
for word in words:
    print(f'Orginal Word : {word}')
    print(f'Stemmer with nltk : {stemmer.stem(word)}')
    print(f'Lemmatization with nltk : {lemmatizer.lemmatize(word)}')
    
    sp_word = sp(word)
    print(f'Lemmatization with spacy : {sp_word[0].lemma_}')

Output:

Orginal Word : compute
Stemmer with nltk : comput
Lemmatization with nltk : compute
Lemmatization with spacy : compute
Orginal Word : computer
Stemmer with nltk : comput
Lemmatization with nltk : computer
Lemmatization with spacy : computer
Orginal Word : computed
Stemmer with nltk : comput
Lemmatization with nltk : computed
Lemmatization with spacy : compute
Orginal Word : computing
Stemmer with nltk : comput
Lemmatization with nltk : computing
Lemmatization with spacy : compute
Orginal Word : said
Stemmer with nltk : said
Lemmatization with nltk : said
Lemmatization with spacy : say
Orginal Word : talked
Stemmer with nltk : talk
Lemmatization with nltk : talked
Lemmatization with spacy : talk
Orginal Word : walked
Stemmer with nltk : walk
Lemmatization with nltk : walked
Lemmatization with spacy : walk
I'mahdi
  • 23,382
  • 5
  • 22
  • 30