How to fix the Porter Stemmer errors?

Question

I was trying to use a function read txt file and tokenize by words including tokenizing, removing white spaces, stemming, collecting word counts, removing stop words but there's something wrong with the stemming since some of "s"s and "r"s was swallowed by the program. Also which part is appropriate to insert the word counts?

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize #split variable into words
from nltk.corpus import stopwords #stopwords
from nltk.stem import PorterStemmer #stem tools
from collections import defaultdict

#1)
def tokenizers(filename):
    #Read files
    file = open(filename, "r", encoding = "utf")
    lines = file.readline()
    file.close()
    #Set stop words and symbols (define symbols)
    stopWords = set(stopwords.words("english"))
    stopWords = stopWords.union(',','(",")','[","]','{","}','#','@','!',':',';','.','?')
    #Tokenize paragrah into words
    sentences = word_tokenize(lines)
    #Stem words, remove "s"
    ps = PorterStemmer()
    filterWords = [ps.stem(w) for w in sentences if not w in stopWords]
    return filterWords

as I know stemmers are not perfect and they can create non-existing words - so your result is normal. You should try lemmatizers. Lemmatizers should works better but stemmers should works faster. [Stemming and Lemmatization in Python](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python) — furas, Oct 17 '19 at 17:41
to count words you can use `counts = collections.Counter(filterWords)`. And then you can get two most common words using `counts.most_common(2)` — furas, Oct 17 '19 at 17:47

How to fix the Porter Stemmer errors?

0 Answers0