3

I have trained a topic model using Top2Vec as follows:

import pandas as pd
from top2vec import Top2Vec
df = data = [['1', 'Beautiful hotel, really enjoyed my stay'], ['2', 'We had a terrible experience. Will not return.'], ['3', 'Lovely hotel. The noise at night, we however did not appreciate']]
  
df = pd.DataFrame(data, columns=['reviewID', 'Review'])
docs = df.text.tolist()
ids = df.reviewID.tolist()

model = Top2Vec(docs, speed = 'deep-learn', workers = 14, document_ids = ids)

Now I would like to reassign the topic(s) that each review was assigned back to the original df for further analyses.

I can retrieve the documents by topic as follows:

documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=45, num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

I however get stuck when trying to retrieve all reviews, each with its assigned topic so as to reassign it to the original df.

Thank you for your help:)

1 Answers1

1

The following is one of the way to find document topics, or adding topics to data columns:

# Get topic numbers and sizes
topic_sizes, topic_nums = model.get_topic_sizes()
# 
topic_doc = df.copy()
for t in topic_nums:
    documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=t, num_docs=topic_sizes[t]) 
    topic_doc.loc[document_ids,t] = 1  # or document_scores if you want to add similarity scores of topics to documents

An update: Another way to add top topic of each document is to simply use model.doc_top:

df["topics"] = model.doc_top
# or use model.get_document_topics for assigning multiple topics (say 2 topics per document) for each document:
topics, topic_scores, topic_words, words_score = model.get_documents_topics(document_index_list, num_topics = 2)
Sam S.
  • 627
  • 1
  • 7
  • 23