0

I am trying to topic model a list of descriptions using LSA. When I tokenize and then create a vocab from the descriptions, the vocab returns letters rather than words.

my_stopwords = nltk.corpus.stopwords.words('english')
word_rooter = nltk.stem.snowball.PorterStemmer(ignore_stopwords=False).stem
my_punctuation = '!"$%&\'()*+,-./:;<=>?[\\]^_`{|}~•@'
custom_stopwords = ['author', 'book', 'books', 'story', 'stories', 'novel', 'series', 'collection', 'edition', 'volume', 'readers', 'reader', 'reprint', 'writer', 'writing'] 

final_stopword_list = custom_stopwords + my_stopwords

# cleaning master function
def clean_tokens(tokens):
    tokens = (tokens).lower() # lower case
    tokens = re.sub('['+my_punctuation + ']+', ' ', tokens) # strip punctuation
    tokens = re.sub('([0-9]+)', '', tokens) # remove numbers
    token_list = [word for word in tokens.split(' ') if word not in final_stopword_list] # remove stopwords
    tokens = ' '.join(token_list)
    return tokens

This is my tokenizer

count_vectoriser = CountVectorizer(tokenizer=clean_tokens)
bag_of_words = count_vectoriser.fit_transform(df.Description)
vocab = count_vectoriser.get_feature_names_out()
print(vocab[:10]) 

And my vocab, which returns

[' ' '#' '\\' 'a' 'b' 'c' 'd' 'e' 'f' 'g']

When I want it to give me words

I am tokenizing from a pandas dataframe so I don't know if that is altering the way I am tokenizing.

0 Answers0