Questions tagged [topic-modeling]

Topic models describe the frequency of topics in documents and text. A "topic" is a group of words which tend to occur together.

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats (source: wikipedia)

Generative models (i.e. the statistical models used for topic modelling)

  • Latent Dirichlet Allocation (LDA)
  • Hierarchical Dirichlet process (HDP)

Software / Libraries

Related Tags :

980 questions
5
votes
2 answers

What do the parameters of the csvIterator mean in Mallet?

I am using mallet topic modelling sample code and though it runs fine, I would like to know what the parameters of this statement actually mean? instances.addThruPipe(new CsvIterator(new FileReader(dataFile), …
London guy
  • 27,522
  • 44
  • 121
  • 179
5
votes
3 answers

Topic Modeling tool for large data set (30GB)

I'm looking for some topic modeling tool which can be applicable to a large data set. My current data set for training is 30 GB. I tried MALLET topic modeling, but always I got OutOfMemoryError. If you have any tips, please let me know.
Benben
  • 1,355
  • 5
  • 18
  • 31
5
votes
1 answer

Incremental training of Topic Models in MALLET

According to the MALLET documentation, it's possible to train topic models incrementally: "-output-model [FILENAME] This option specifies a file to write a serialized MALLET topic trainer object. This type of output is appropriate for pausing…
vpekar
  • 3,275
  • 1
  • 19
  • 16
5
votes
2 answers

Run cvb in mahout 0.8

The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can be parallelized way better. Unfortunately there is only documentation…
JoKnopp
  • 171
  • 1
  • 9
5
votes
1 answer

Implementing Topic Model with Python (numpy)

Recently, I implemented Gibbs sampling for LDA topic model on Python using numpy, taking as a reference some code from a site. In each iteration of Gibbs sampling, we remove one (current) word, sample a new topic for that word according to a…
D T
  • 677
  • 12
  • 23
5
votes
2 answers

Removing an "empty" character item from a corpus of documents in R?

I am using the tm and lda packages in R to topic model a corpus of news articles. However, I am getting a "non-character" problem represented as "" that is messing up my topics. Here is my workflow: text <- Corpus(VectorSource(d$text)) newtext <-…
user836015
5
votes
1 answer

Topic Modeling: How do I use my fitted LDA model to predict new topics for a new dataset in R?

I am using 'lda' package in R for topic modeling. I want to predict new topics(collection of related words in a document) using a fitted Latent Dirichlet Allocation(LDA) model for new dataset. In the process, I came across predictive.distribution()…
ankit sethi
  • 51
  • 1
  • 3
4
votes
0 answers

Gensim HDP - Top Topics' distribution for document

I want topic distribution for my documents. However, Gensim's HDP's show_topic() returns 20 topics by default. And I suppose they are not supposed to be the best. After digging deeper, I found out there are total 150 topics, as the truncation level…
Shirish Bajpai
  • 608
  • 1
  • 5
  • 16
4
votes
1 answer

(gensim) LdaMallet vs LdaModel?

What is the difference between using gensim.models.LdaMallet and gensim.models.LdaModel? I noticed that the parameters are not all the same and would like to know when one should be used over the other?
Desi Pilla
  • 544
  • 6
  • 20
4
votes
3 answers

A practical example of GSDMM in python?

I want to use GSDMM to assign topics to some tweets in my data set. The only examples I found (1 and 2) are not detailed enough. I was wondering if you know of a source (or care enough to make a small example) that shows how GSDMM is implemented…
Pie-ton
  • 550
  • 4
  • 17
4
votes
4 answers

Coherence score (u_mass) -18 is good or bad?

I read this question (Coherence score 0.4 is good or bad?) and found that the coherence score (u_mass) is from -14 to 14. But when I did my experiments, I got a score of -18 for u_mass and 0.67 for c_v. I wonder how is my u_mass score out of range…
Dammio
  • 911
  • 1
  • 7
  • 15
4
votes
1 answer

After applying gensim LDA topic modeling, how to get documents with highest probability for each topic and save them in a csv file?

I have used gensim LDA Topic Modeling to get associated topics from a corpus. Now I want to get the top 20 documents representing each topic: documents that have the highest probability in a topic. And I want to save them in a CSV file with this…
Aria
  • 41
  • 1
  • 3
4
votes
1 answer

pyspark LDA get words in topics

I am trying to run LDA. I am not applying it to words and documents, but error messages and error-cause. each row is an error and each column is an error cause. A cell is 1 if error cause was active, and 0 if error cause was not active. Now I am…
LN_P
  • 1,448
  • 4
  • 21
  • 37
4
votes
1 answer

pyLDAvis | Could I get "Top-30 Most Relevant Terms for Topic"?

During the Topicmodeling visualization through LDAvis, I found that Slide to adjust relevance metric varies depending on the topic and lambda values. Is there a way to get this word list? I want to get the representative words that vary depending on…
4
votes
1 answer

sklearn LatentDirichletAllocation topic inference on new corpus

I have been using the sklearn.decomposition.LatentDirichletAllocation module to explore a corpus of documents. After a number of iterations of training and adjusting the model (i.e. adding stopwords and synonyms, varying the number of topics), I am…