Questions tagged [topic-modeling]

Topic models describe the frequency of topics in documents and text. A "topic" is a group of words which tend to occur together.

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats (source: wikipedia)

Generative models (i.e. the statistical models used for topic modelling)

  • Latent Dirichlet Allocation (LDA)
  • Hierarchical Dirichlet process (HDP)

Software / Libraries

Related Tags :

980 questions
4
votes
4 answers

Removing stopwords from a user-defined corpus in R

I have a set of documents: documents = c("She had toast for breakfast", "The coffee this morning was excellent", "For lunch let's all have pancakes", "Later in the day, there will be more talks", "The talks on the first day were great", …
StatsSorceress
  • 3,019
  • 7
  • 41
  • 82
4
votes
1 answer

R LDAvis K=2 createJSON() error

I was using LDAvis package's createJSON() function while my topicmodel was for 2 topics and received this error Error in stats::cmdscale(dist.mat, k = 2) : 'k' must be in {1, 2, .. n - 1} Then I tested with reproducible example given here, by…
anonR
  • 849
  • 7
  • 26
4
votes
1 answer

Gensim LdaMulticore is not multiprocessing properly (using just 4 workers)

I am using Gensim's LDAMulticore to perform LDA. I have around 28M small documents (around 100 characters each). I have given workers argument to be 20 but the top shows it using only 4 processes. There are some discussions around it that it might…
Naman
  • 2,569
  • 4
  • 27
  • 44
4
votes
1 answer

LDA topic modeling input data

I am new to python. I just started working on a project to use LDA topic modeling on tweets. I am trying the the following code: This example uses an online dataset . I have a csv file that includes the tweets that I need to use. Can anybody tell…
Robbert
  • 73
  • 3
  • 8
4
votes
3 answers

Gensim LDA - Default number of iterations

I wish to know the default number of iterations in gensim's LDA (Latent Dirichlet Allocation) algorithm. I don't think the documentation talks about this. (Number of iterations is denoted by the parameter iterations while initializing the LdaModel…
Utsav T
  • 1,515
  • 2
  • 24
  • 42
4
votes
1 answer

Topic modeling in R: Building topics based on a predefined list of terms

I’ve spent a couple of days working on topic models in R and I’m wondering if I could do the following: I would like R to build topics based on a predefined termlist with specific terms. I already worked with this list to identify ngrams (RWeka) in…
Dobby
  • 75
  • 5
4
votes
0 answers

How to get the probability distribution of words for a particular topic?

I am doing topic modelling using Mallet and everything works fine except that I am unable to get the probability distribution of the words in any particular topic. However, I am using the below code to print the topic proportions for any particular…
London guy
  • 27,522
  • 44
  • 121
  • 179
4
votes
3 answers

how to add tokens to gensim dictionary

I use gensim to build dictionary from a collection of documents. Each document is a list of tokens. this my code def constructModel(self, docTokens): """ Given document tokens, constructs the tf-idf and similarity models""" #construct…
Athari
  • 171
  • 3
  • 11
4
votes
1 answer

R - LDA Topic Model Output Data

I'm working on building some topic models in R using the 'topicmodels' package. After pre-processing and creating a document term matrix, I am applying the following LDA Gibbs model. This may be a simple answer but I'm a newbie to R so here it goes.…
user3587152
  • 73
  • 3
  • 5
4
votes
1 answer

Error while loading class when using Stanford Topic Modeling Toolkit (TMT)

I have tried JDK7-Update40 and JDK8, but still cannot run the test codes from the TMT website. Everytime I click 'run', it give error messages as below: error: error while loading CharSequence, class file 'C:\Program …
Joyce Zhou
  • 41
  • 4
4
votes
2 answers

Passing Python strings to Mallet for topic modelling

I'm building a corpus of texts harvested alongside some metadata from HTML with BeautifulSoup. It would be really helpful if I could call Mallet from within Python, and have it model topics from Python strings, rather than from text files in a…
user2437842
  • 139
  • 1
  • 10
4
votes
0 answers

Implementing deep belief network for topic modelling

I'm trying to implement the deep belief network for the Semantic Hashing article (http://www.cs.toronto.edu/~hinton/absps/sh.pdf) by Geoffrey Hinton and Ruslan Salakhutdinov. I have a hard time figuring out how to implement the constrained poisson…
4
votes
0 answers

How can I infer a new document against Mahout TopicModel output?

Given a topic model from Mahout LDA CVB program/offline batch execution, I like to infer a new document using the model/online web service calls. These documents are not incrediably helpful for new-ing and infer-ing. *…
4
votes
1 answer

Labelled LDA usage

I am working on a project which requires applying the topic model LDA. Because each document in my case is short, I have to use Labelled LDA. I do not have much knowledge in this area, and all I need to do is to apply the LLDA to my data. After…
lelelulu
  • 75
  • 1
  • 9
4
votes
1 answer

Trying to remove words from a DocumentTermMatrix in order to use topicmodels

So, I am trying to use the topicmodels package for R (100 topics on a corpus of ~6400 documents, which are each ~1000 words). The process runs and then dies, I think because it is running out of memory. So I try to shrink the size of the document…
cforster
  • 577
  • 2
  • 7
  • 19