0

Need explanation on what get_documents_topics(doc_ids, reduced=False, num_topics=1) does.

Get document topics. The topic of each document will be returned. The corresponding original topics are returned unless reduced=True, in which case the reduced topics will be returned.

Returns:

  • topic_nums (array of int, shape(len(doc_ids), num_topics)) – The topic number(s) of the document corresponding to each doc_id.
  • topic_score (array of float, shape(len(doc_ids), num_topics)) – Semantic similarity of document to topic(s). The cosine similarity of the document and topic vector.
  • topics_words (array of shape(len(doc_ids), num_topics, 50)) – For each topic the top 50 words are returned, in order of semantic similarity to topic.
  • word_scores (array of shape(num_topics, 50)) – For each topic the cosine similarity scores of the top 50 words to the topic are returned.

Using BBC News Classification news text.

document_id = 1
document = train_df.iloc[document_id]['Text']
document
---
german business confidence slides german business confidence fell in february knocking hopes of a speedy recovery in europe s largest economy....
topic_nums, topic_score, topics_words, word_scores = \
    model.get_documents_topics([document_id], reduced=False)

print(f"topic_nums:{topic_nums}, topic_score: {topic_score}")
for word, score in zip(topics_words[0][:10], word_scores[0][:10]):
    print(f"{word:20}: {score}")
-----
topic_nums:[0], topic_score: [0.3969033]
parliament          : 0.10377583652734756
politicians         : 0.10281675308942795
britain             : 0.10191775858402252
election            : 0.09515437483787537
elections           : 0.0923602283000946
no                  : 0.08872390538454056
non                 : 0.0843275785446167
voters              : 0.08393856137990952
british             : 0.08337553590536118
bbc                 : 0.08136938512325287

What is topic_nums? Is it an ID of a topic or number of topics related to the document (document_id = 1)?

I believe topic in the document is Topic Vector which is a mean of a document vector cluster but please correct if it is not.

enter image description here

mon
  • 18,789
  • 22
  • 112
  • 205

0 Answers0