Need explanation on what get_documents_topics(doc_ids, reduced=False, num_topics=1) does.
Get document topics. The topic of each document will be returned. The corresponding original topics are returned unless reduced=True, in which case the reduced topics will be returned.
Returns:
- topic_nums (array of int, shape(len(doc_ids), num_topics)) – The topic number(s) of the document corresponding to each doc_id.
- topic_score (array of float, shape(len(doc_ids), num_topics)) – Semantic similarity of document to topic(s). The cosine similarity of the document and topic vector.
- topics_words (array of shape(len(doc_ids), num_topics, 50)) – For each topic the top 50 words are returned, in order of semantic similarity to topic.
- word_scores (array of shape(num_topics, 50)) – For each topic the cosine similarity scores of the top 50 words to the topic are returned.
Using BBC News Classification news text.
document_id = 1
document = train_df.iloc[document_id]['Text']
document
---
german business confidence slides german business confidence fell in february knocking hopes of a speedy recovery in europe s largest economy....
topic_nums, topic_score, topics_words, word_scores = \
model.get_documents_topics([document_id], reduced=False)
print(f"topic_nums:{topic_nums}, topic_score: {topic_score}")
for word, score in zip(topics_words[0][:10], word_scores[0][:10]):
print(f"{word:20}: {score}")
-----
topic_nums:[0], topic_score: [0.3969033]
parliament : 0.10377583652734756
politicians : 0.10281675308942795
britain : 0.10191775858402252
election : 0.09515437483787537
elections : 0.0923602283000946
no : 0.08872390538454056
non : 0.0843275785446167
voters : 0.08393856137990952
british : 0.08337553590536118
bbc : 0.08136938512325287
What is topic_nums
? Is it an ID of a topic or number of topics related to the document (document_id = 1)?
I believe topic in the document is Topic Vector which is a mean of a document vector cluster but please correct if it is not.