I am trying to topic model a list of descriptions using LSA. When I tokenize and then create a vocab from the descriptions, the vocab returns letters rather than words.
my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'
custom_stopwords = ['author', 'book', 'books', 'story', 'stories', 'novel', 'series', 'collection', 'edition', 'volume', 'readers', 'reader', 'reprint', 'writer', 'writing']
final_stopword_list = custom_stopwords + my_stopwords
# cleaning master function
def clean_tokens(tokens):
tokens = (tokens).lower() # lower case
tokens = re.sub('['+my_punctuation + ']+', ' ', tokens) # strip punctuation
tokens = re.sub('([0-9]+)', '', tokens) # remove numbers
token_list = [word for word in tokens.split(' ') if word not in final_stopword_list] # remove stopwords
tokens = ' '.join(token_list)
return tokens
This is my tokenizer
count_vectoriser = CountVectorizer(tokenizer=clean_tokens)
bag_of_words = count_vectoriser.fit_transform(df.Description)
vocab = count_vectoriser.get_feature_names_out()
print(vocab[:10])
And my vocab, which returns
[' ' '#' '\\' 'a' 'b' 'c' 'd' 'e' 'f' 'g']
When I want it to give me words
I am tokenizing from a pandas dataframe so I don't know if that is altering the way I am tokenizing.