I ran BERTopic to get topics for 3,500 documents. How could I get the topic-probs matrix for each document and export them to csv? When I export them, I want to export the identifier of each document too.
I tried two approaches: First, I found topic_model.visualize_distribution(probs[#]) gives the information that I want. But how can I export the topics-probs data for each document to csv?
Second, I found this thread (How to get all docoments per topic in bertopic modeling) can be useful if I can add the column for probabilities to the data frame it generates. Is there any way to do that?
Please share any other approaches that can produce and export the topic-probabilities matrix for all documents.
For your information, this is my BERTopic code. Thank you!
embedding_model = SentenceTransformer('all-mpnet-base-v2')
umap_model = UMAP(n_neighbors=15)
hdbscan_model = HDBSCAN(min_cluster_size=20, min_samples=1,
gen_min_span_tree=True,
prediction_data=True)
stopwords = list(stopwords.words('english')) + ['http', 'https', 'amp', 'com']
vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=stopwords)
model1 = BERTopic(
umap_model=umap_model,
hdbscan_model=hdbscan_model,
embedding_model=embedding_model,
vectorizer_model=vectorizer_model,
language='english',
calculate_probabilities=True,
verbose=True
)
topics, probs = model1.fit_transform(data)