3

I ran BERTopic to get topics for 3,500 documents. How could I get the topic-probs matrix for each document and export them to csv? When I export them, I want to export the identifier of each document too.

I tried two approaches: First, I found topic_model.visualize_distribution(probs[#]) gives the information that I want. But how can I export the topics-probs data for each document to csv?

Second, I found this thread (How to get all docoments per topic in bertopic modeling) can be useful if I can add the column for probabilities to the data frame it generates. Is there any way to do that?

Please share any other approaches that can produce and export the topic-probabilities matrix for all documents.

For your information, this is my BERTopic code. Thank you!

embedding_model = SentenceTransformer('all-mpnet-base-v2')
umap_model = UMAP(n_neighbors=15)
hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=1,
                        gen_min_span_tree=True,
                        prediction_data=True)

stopwords = list(stopwords.words('english')) + ['http', 'https', 'amp', 'com']
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stopwords)

model1 = BERTopic(
    umap_model=umap_model,
    hdbscan_model=hdbscan_model,
    embedding_model=embedding_model,
    vectorizer_model=vectorizer_model,
    language='english',
    calculate_probabilities=True,
    verbose=True
)
topics, probs = model1.fit_transform(data)
JJD
  • 31
  • 4

1 Answers1

2

The probs variable contains all the topic probabilities corresponding to each individual document. You can create a dataframe from those values like so:

#convert 2D array to pandas DataFrame    
topic_prob_df = pd.DataFrame(probs)
#create 'data' column - or, alternatively, an identifier column for data
topic_prob_df['data'] = data
#export as csv 
topic_prob_df.to_csv('topic-probs.csv')
A.T.B
  • 625
  • 6
  • 16
  • Thank you! In this way, I got three columns showing probabilities and one column for 'data'. But how do I know which topic with which the probability values are associated? I have 56 different topics and don't know which topic probabilities those three columns mean. – JJD Sep 19 '22 at 17:57
  • According to the source code, setting calculate_probabilities to True should return all the the probabilities of all topics across all documents. Can you share the shape of your 'probs' array as returned by fit_transform method? – A.T.B Sep 19 '22 at 18:24
  • Yes, my 'probs' array is: array([[2.37977792e-002, 5.68686253e-002, 9.19333595e-001], [2.13985734e-309, 1.76208444e-309, 1.00000000e+000], [1.25879747e-309, 2.29313720e-309, 1.00000000e+000], ..., [9.23867916e-003, 1.72322197e-002, 9.73529101e-001], [6.63774096e-003, 1.45281342e-002, 9.78834125e-001], [1.60348529e-002, 4.77972587e-002, 9.36167888e-001]]) – JJD Sep 19 '22 at 19:51
  • And the output for model1.get_topics() shows you 56 different topics? – A.T.B Sep 19 '22 at 20:03
  • Yes, and now I have 53 clusters after I ran the program again. And now it works! There must have been some wrong manipulation. Thanks a lot! But it gives probs for only 52 clusters (52 different columns for each document) not 53 clusters. Is this because the program drops the cluster (-1) as (-1) group is not meaningful? – JJD Sep 19 '22 at 20:42
  • Good to know it's working! The number of columns should be exactly equivalent to the number of topics as they are being created using the probs array which contains all topic probabilities. So, either it's a counting error (keeping in mind that indexing in python is from 0), or, as you mentioned, the -1 topic which doesn't actually count as a topic – A.T.B Sep 20 '22 at 09:53