1

I was trying to use a function read txt file and tokenize by words including tokenizing, removing white spaces, stemming, collecting word counts, removing stop words but there's something wrong with the stemming since some of "s"s and "r"s was swallowed by the program. Also which part is appropriate to insert the word counts?

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize #split variable into words
from nltk.corpus import stopwords #stopwords
from nltk.stem import PorterStemmer #stem tools
from collections import defaultdict

#1)
def tokenizers(filename):
    #Read files
    file = open(filename, "r", encoding = "utf")
    lines = file.readline()
    file.close()
    #Set stop words and symbols (define symbols)
    stopWords = set(stopwords.words("english"))
    stopWords = stopWords.union(',','(",")','[","]','{","}','#','@','!',':',';','.','?')
    #Tokenize paragrah into words
    sentences = word_tokenize(lines)
    #Stem words, remove "s"
    ps = PorterStemmer()
    filterWords = [ps.stem(w) for w in sentences if not w in stopWords]
    return filterWords
lily
  • 47
  • 1
  • 8
  • as I know stemmers are not perfect and they can create non-existing words - so your result is normal. You should try lemmatizers. Lemmatizers should works better but stemmers should works faster. [Stemming and Lemmatization in Python](https://www.datacamp.com/community/tutorials/stemming-lemmatization-python) – furas Oct 17 '19 at 17:41
  • to count words you can use `counts = collections.Counter(filterWords)`. And then you can get two most common words using `counts.most_common(2)` – furas Oct 17 '19 at 17:47
  • Perhaps this https://stackoverflow.com/a/50689970/610569 ? – alvas Oct 18 '19 at 01:56

0 Answers0