Questions tagged [topic-modeling]

Topic models describe the frequency of topics in documents and text. A "topic" is a group of words which tend to occur together.

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats (source: wikipedia)

Generative models (i.e. the statistical models used for topic modelling)

  • Latent Dirichlet Allocation (LDA)
  • Hierarchical Dirichlet process (HDP)

Software / Libraries

Related Tags :

980 questions
7
votes
1 answer

Using LDA(topic model) : the distrubution of each topic over words are similar and "flat"

Latent Dirichlet Allocation(LDA) is a topic model to find latent variable (topics) underlying a bunch of documents. I'm using python gensim package and having two problems: I printed out the most frequent words for each topic (I tried 10,20,50…
Ruby
  • 284
  • 1
  • 5
  • 18
7
votes
1 answer

How can I speed up a topic model in R?

Background I am trying to fit a topic model with the following data and specification documents=140 000, words = 3000, and topics = 15. I am using the package topicmodels in R (3.1.2) on a Windows 7 machine (ram 24 GB, 8 cores). My problem is that…
7
votes
5 answers

Mallet topic model example can not compile

I want to compile mallet in my Java (instead using the command line), so I include the jar in my project, and cite the code of the example from: http://mallet.cs.umass.edu/topics-devel.php, however, when I run this code, there is error that…
flyingmouse
  • 1,014
  • 3
  • 13
  • 29
7
votes
3 answers

Text Clustering and topic extraction

I'm doing some text mining using the excellent scikit-learn module. I'm trying to cluster and classify scientific abstracts. I'm looking for a way to cluster my set of tf-id representations, without having to specify the number of clusters in…
Misconstruction
  • 1,839
  • 4
  • 17
  • 23
7
votes
2 answers

Topic modelling, but with known topics?

Okay, so usually topic models (such as LDA, pLSI, etc.) are used to infer topics that may be present in a set of documents, in an unsupervised fashion. I would like to know if anyone has any ideas as to how I can shoehorn my problem into an LDA…
user1871183
  • 439
  • 5
  • 11
6
votes
1 answer

GSDMM Convergence of Clusters (Short Text Clustering)

I am using this GSDMM python implementation to cluster a dataset of text messages. GSDMM converges fast (around 5 iterations) according the inital paper. I also have a convergence to a certain number of clusters, but there are still a lot of…
simon
  • 83
  • 8
6
votes
1 answer

ValueError: Stop argument for islice() must be None or an integer: 0 <= x <= sys.maxsize on topic coherence

im following this tutorials https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0 and find problem. so my purpose on this code to make iterate it over the range of topics, alpha, and beta…
6
votes
1 answer

Topic Modeling in Mallet; Documentation

I'm looking for some good documentation for Mallet, specifically for its classes related to topic modeling. I've looked at the Java docs but they aren't too helpful. For example: estimate public void estimate() throws…
akobre01
  • 777
  • 1
  • 10
  • 22
6
votes
2 answers

Negative Values: Evaluate Gensim LDA with Topic Coherence

I´m currently trying to evaluate my topic models with gensim topiccoherencemodel: from gensim.models.coherencemodel import CoherenceModel cm_u_mass = CoherenceModel(model = model1, corpus = corpus1, coherence = 'u_mass') coherence_u_mass =…
Nils_Denter
  • 488
  • 1
  • 6
  • 18
6
votes
1 answer

Stem completion in R replaces names, not data

My team is doing some topic modeling on medium-sized chunks of text (tens of thousands of words), using the Quanteda package in R. I'd like to reduce words to word stems before the topic modeling process, so that I'm not counting variations on the…
J. Trimarco
  • 149
  • 1
  • 8
6
votes
1 answer

How to interpret Sklearn LDA perplexity score. Why it always increase as number of topics increase?

I try to find the optimal number of topics using LDA model of sklearn. To do this I calculate perplexity by referring code on https://gist.github.com/tmylk/b71bf7d3ec2f203bfce2. But when I increase the number of topics, perplexity always increase …
JonghoKim
  • 1,965
  • 7
  • 21
  • 44
6
votes
1 answer

Automatic labeling of LDA generated topics

I'm trying to categorize customer feedback and I ran an LDA in python and got the following output for 10 topics: (0, u'0.559*"delivery" + 0.124*"area" + 0.018*"mile" + 0.016*"option" + 0.012*"partner" + 0.011*"traffic" + 0.011*"hub" +…
Arman
  • 827
  • 3
  • 14
  • 28
6
votes
1 answer

error Installing topicmodels in R Ubuntu

I am getting error while installing topicmodels package in R. on running install.packages("topicmodels",dependencies=TRUE) following are the last few lines I am getting. Please help. My R version is 3.1.3. g++ -I/usr/share/R/include -DNDEBUG …
Mohit Mangal
  • 89
  • 1
  • 5
6
votes
3 answers

Are there any efficient python libraries for Dynamic Topic Models, preferably extending Gensim?

I'm trying to model twitter stream data with topic models. Gensim, being an easy to use solution, is impressive in it's simplicity. It has a truly online implementation for LSI, but not for LDA. For a changing content stream like twitter, Dynamic…
Ravi Karan
  • 445
  • 1
  • 7
  • 13
6
votes
1 answer

hierarchical classification + topic model training data for internet articles and social media

I want to classify large numbers (100K to 1M+) of smallish internet-based articles (tweets, blog articles, news, etc) by topic. Toward this goal, I have been looking for labeled training data documents which I could use to build classifier…