3

I calculated a topic model, so far so good.

First of all my dataframe looks like this:

identifier     comment_cleaned
1              some cleaned comment
2              another cleaned comment
8              
...            ...

Then I calcuated my models like this:

import lda
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

def remove_allzerorows(smatrix):
    nonzero_row_indice, _ = smatrix.nonzero()
    unique_nonzero_indice = np.unique(nonzero_row_indice)
    return smatrix[unique_nonzero_indice]

univectorizer = CountVectorizer(analyzer = "word", min_df = 0.001, ngram_range = (1,1)) 
unicorpus = univectorizer.fit_transform(df["comment_cleaned"])
unicorpus = remove_allzerorows(unicorpus)
unigrams = univectorizer.get_feature_names()

n_topics = [2, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, 100, 110, 120]
n_iter = 2000
alpha = 0.1
beta = 0.01

for topics in n_topics:
    print("start with number of topics:", topics)
    lda_model = lda.LDA(
                    n_topics = topics, n_iter = n_iter,
                    alpha = alpha, eta = beta,
                    random_state = 42
                   )
    lda_model.fit(unicorpus)
    joblib.dump(lda_model, f"models/lda_{topics}topics.pkl") 

Afterwards I evaluated the topics and chose the number of topics that represent my dataset the best. This was 80 topics. Now what I would like to do is: Add 80 columns to my dataframe that represent the topic distributions. In the end it would look like this:

identifier     comment_cleaned          topic_1      topic_2     ...
1              some cleaned comment     0.11         0.0         ...
2              another cleaned comment  0.30         0.1         ...
8                                       0.00         0.0         ...
...            ...                      ...          ...         ...

Basically I understand how to create a document-topic matrix. But I do not get how to append my initial dataframe with this:

best_lda_model = joblib.load(f"models/lda_80topics.pkl")
lda_output = best_lda_model.transform(unicorpus)
df_document_topic = pd.DataFrame(np.round(lda_output, 2))

Any help? Thank you!

cian
  • 191
  • 2
  • 11

1 Answers1

1

If your dataframe is N rows long, and you have a matrix M that is NxT where T is the number of topics - then to add this matrix to the dataframe, all you need to do is generate a list of T strings to use as your new column names - maybe like:

new_column_names = ["topic_{t}".format(t=t) for t in range(0,M.shape[1])]

Then you can simply plonk the matrix values into the dataframe like this:

df_document_topic[new_column_names] = M

Pandas should realise what you're trying to do and apply the data.

You might have to fiddle about with the dimensions of your results array, but as long as they're correct, Pandas should manage the details.

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42