Questions tagged [topic-modeling]

Topic models describe the frequency of topics in documents and text. A "topic" is a group of words which tend to occur together.

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats (source: wikipedia)

Generative models (i.e. the statistical models used for topic modelling)

  • Latent Dirichlet Allocation (LDA)
  • Hierarchical Dirichlet process (HDP)

Software / Libraries

Related Tags :

980 questions
6
votes
1 answer

Gensim Dictionary Implementation

I was just curious about the gensim dictionary implementation. I have the following code: def build_dictionary(documents): dictionary = corpora.Dictionary(documents) dictionary.save('/tmp/deerwester.dict') # store the dictionary …
dmil
  • 119
  • 1
  • 9
6
votes
1 answer

Representation and a good similarity measure between Tweets for topic detection

I'm planning to write a tool for Topic Detection on Twitter. I've been thinking about a good similarity measure (distance) between two tweets, and how to represent them, taking in count: The #hashtags (I think hashtags are very important when…
6
votes
2 answers

Latent Dirichlet Allocation Solution Example

I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and based on this blog post http://goo.gl/ccPvE I was able to develop the intuition behind LDA. However I still haven't…
user737128
  • 169
  • 1
  • 7
5
votes
1 answer

Cast topic modeling outcome to dataframe

I have used BertTopic with KeyBERT to extract some topics from some docs from bertopic import BERTopic topic_model = BERTopic(nr_topics="auto", verbose=True, n_gram_range=(1, 4), calculate_probabilities=True,…
xavi
  • 80
  • 1
  • 12
5
votes
1 answer

How to extract text from a two-column PDF using PDFPlumber

I am working on topic modeling tasks using python and I would like to extract texts from annual/sustainability reports. However my problem is, when I tried to extract the report, the extracted lines are broken between two different columns in a page…
5
votes
1 answer

How can I replace emojis with text and treat them as single words?

I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results. A red heart emoji is translated as "red heart…
TR_IBK21
  • 67
  • 4
5
votes
1 answer

Genism Module attribute error for wrappers

I am going to find the optimal number of topics for LDA. To do this, I used GENSIM as follows : def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3): coherence_values = [] model_list = [] for num_topics in…
5
votes
4 answers

Topic modeling on short texts Python

I want to do topic modeling on short texts. I did some research on LDA and found that it doesn't go well with short texts. What methods would be better and do they have Python implementations?
Sri Test
  • 389
  • 1
  • 4
  • 21
5
votes
1 answer

What is the difference between LDA and NTM in Amazon Sagemaker for Topic Modeling?

I am looking for difference between LDA and NTM . What are some use case where you will use LDA over NTM? As per AWS doc: LDA : The Amazon SageMaker Latent Dirichlet Allocation (LDA) algorithm is an unsupervised learning algorithm that attempts to…
Saurabh
  • 559
  • 6
  • 10
5
votes
0 answers

Perplexity increases with number of topics

There are quite some posts about this specific issue, but I was unable to solve this problem. I have been experimenting with LDA on the 20newgroup corpus with both the Sklearn and Gensim implementation. It is described in the literature that…
Bas
  • 111
  • 8
5
votes
2 answers

probabilities returned by gensim's get_document_topics method doesn't add up to one

Sometimes it returns probabilities for all topics and all is fine, but sometimes it returns probabilities for just a few topics and they don't add up to one, it seems it depends on the document. Generally when it returns few topics, the…
nestor556
  • 446
  • 4
  • 15
5
votes
0 answers

Firebase unsubscribe from topic not work

I subscribe to a topic in fcm with 2 android devices and after I unsubscribe the topic with one device, I still could send messages with the device that I unsubscribe.
5
votes
4 answers

pyldavis Unable to view the graph

I am trying to visually depict my topics in python using pyldavis. However i am unable to view the graph. Is it that we have to view the graph in the browser or will it get popped upon execution. Below is my code import pyLDAvis import…
Deepa Huddar
  • 321
  • 1
  • 4
  • 15
5
votes
2 answers

Using Topic Model, how should we set up a "stop words" list?

There are some standard stop lists, giving words like "a the of not" to be removed from corpus. However, I'm wondering, should the stop list change case by case? For example, I have 10K of articles from a journal, then because of the structure of an…
Ruby
  • 284
  • 1
  • 5
  • 18
5
votes
2 answers

How to parallelize topicmodels R package

I have a series of documents (~50,000), that I've transformed into a corpus and have been building LDA objects using the topicmodels package in R. Unfortunately, in order to test more than 150 topics, it takes several hours. So far, I've found that…
Optimus
  • 1,354
  • 1
  • 21
  • 40