1

I have a set of trigrams (see pickle file). The column name is the trigram; each cell represents a document; the cell entries denominate the occurrence (binary).

I then preprocess the trigrams and train an LDA model using the below code. However, being new to LDA Mallet, I am doing something wrong -- and the "words" printed from a wordcloud are just numbers. I am lost and cannot figure out where the connection of words to number representation is lost/ how to recover it.

with open('small_trigrams.pkl', 'rb') as file:
    small_trigrams = pickle.load(file)

small_mydict = gensim.corpora.Dictionary()    
small_trigrams_collection = []

for col in small_trigrams.columns:
    trigram = col.replace("(", "").replace("'", "").replace(" ", "").replace(")", "").strip().split(",", 3)
    value = small_trigrams[col].sum() # trigram occurrences
    for i in range(int(value)):
        small_trigrams_collection.append(trigram)            
small_mycorp = [small_mydict.doc2bow(trigram, allow_update=True) for trigram in small_trigrams_collection] # create corpus


# Train LDA on the trigrams features, assess topic coherence
small_topics_coherence = {} # dict with topics: coherence score
small_models = {} # collection of models

# train LDA on trigrams features
model = LdaMallet(path_to_mallet_binary,corpus=small_mycorp, num_topics=i, id2word=small_mydict) # train model

for t in range(model.num_topics)[:6]:
    plt.figure()
    plt.imshow(WordCloud().fit_words(dict(lda.show_topic(t, 200))))
    plt.axis("off")
    plt.title("Topic #" + str(t))
    plt.show()

Can someone point me to my mistake?

user456789
  • 331
  • 1
  • 3
  • 9

0 Answers0