Questions tagged [topic-modeling]

Topic models describe the frequency of topics in documents and text. A "topic" is a group of words which tend to occur together.

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats (source: wikipedia)

Generative models (i.e. the statistical models used for topic modelling)

  • Latent Dirichlet Allocation (LDA)
  • Hierarchical Dirichlet process (HDP)

Software / Libraries

Related Tags :

980 questions
0
votes
1 answer

How to find the number of documents (and fraction) per topic using LDA?

I am trying to extract topic from 7 millons of Twitter data. I have assumed each tweet as a document. So, I stored all tweets in a file where each line (or tweet) treated as a document. I used this file as a input file for Mallet api. public static…
Khaled
  • 255
  • 4
  • 16
0
votes
1 answer

Topic or Tag suggestion algorithm

Here is the problem: When given a block of text, I want to suggest possible topics . For example, a news article about Kobe Bryant would have suggested tags like: ‘basketball’, ‘nba’, ‘sports’. I have a fairly large training dataset (350k+) that…
user3287712
  • 57
  • 1
  • 8
0
votes
1 answer

LDA generated topics

so I am relatively new working with gensim and LDA, started about two weeks ago and I am having trouble trusting these results. The following are the topics produced by using 11 1-paragraph documents. topic #0 (0.500): 0.059*island + 0.059*world +…
cs123
  • 13
  • 2
0
votes
1 answer

Latent Dirichlet Allocation using Gensim on more than one corpus

I have two questions related to the usage of gensim for LDA. 1) How can I create a model using one corpus, save it and perhaps extend it later on another corpus by training the model on it ? Is it possible ? 2) Can LDA be used to classify an unseen…
Utsav T
  • 1,515
  • 2
  • 24
  • 42
0
votes
0 answers

which.max(sapply, train_gibbs, logLik) error

So, I am following Grun and Hornik (http://www.jstatsoft.org/v40/i13/) method of 10 fold cross validation by calculating perplexity from 10-fold training and test set. But I have error when I create test_gibbs which is stated the end of the code…
user37874
  • 415
  • 1
  • 5
  • 11
0
votes
1 answer

document-topic probability after training topic models using "topicmodels" in R: gamma or posterior()?

Below is what I get after training 3328 text files using gibbs sampling. I need to save a file that contains document_topic probability. Is gamma the document-topic probability? But most of the numbers are smoothed and not very informative in terms…
user37874
  • 415
  • 1
  • 5
  • 11
0
votes
2 answers

How can Topic Modeling noise be removed?

I am working on Topic Modeling where the given text corpus have lots of noise in form of supporting words after removal of stop words. These words have high term frequency but does not help in forming topic terms by using LDA along with other words…
0
votes
1 answer

Topic modeling a corpus with one "majority topic" and several "minority topics"

I have a collection of documents, and most of them are about the same topic, and the rest are basically random topics. I wish to classify the documents into whether they are about the "majority topic" or are one of these random "minority topics".…
0
votes
1 answer

Using topic modeling Java toolkit

I'm working on text classification and I want to use Topic models (LDA). My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news. I saw two Java toolkits:…
S_M
  • 290
  • 3
  • 18
0
votes
1 answer

R Topic Modeling avoiding create_matrix

Normally when topic modeling I use something along the lines of: matrix <- create_matrix(cbind(as.vector(lda_data)), language="english", removeNumbers=TRUE, weighting=weightTf) k <- 20 #Hardcoded temp value lda <- LDA(matrix, k, method = "Gibbs",…
TheoretiCAL
  • 19,461
  • 8
  • 43
  • 65
0
votes
2 answers

MALLET Java API Importing Data

I am trying to do Topic Modeling with the Java API. There is a handy example provided with the package. However, given the much larger size of my data, I think it would be impractical to import it all from one file. I looked at the powerpoint…
pjshap
  • 72
  • 11
0
votes
1 answer

Keep digits in Mallet topic modeling

I am using Mallet for topic modeling. A large amount of words in my input text include both letters and digits; e.g., A54, D892. I just noticed that Mallet automatically removes the digits and only keeps the letters in the words. I even do not use…
SM.
  • 1
  • 1
0
votes
0 answers

How to import and use feature vectors in MALLET's topic modelling?

I am using MALLET's topic modelling. I have set of keywords along with weights for a set of documents which I want to train and use the model to infer new documents. Note: each keyword of the document has weight assigned to it which is similar to…
sravan_kumar
  • 1,129
  • 1
  • 13
  • 25
0
votes
1 answer

Implementation advice on semi-supervised automated tagging

I'm wondering what approaches exist to develop an automated tagging system. I'm building a company-internal feedback platform and our business users wish to add tags to the posts. I'd like to build a system that suggests tags to users as they post,…
0
votes
1 answer

Mallet dirichelet parameter higher than 1

I've been using MALLET in order to perform my topic modeling(LDA). I tried to discover 20 topics in a dataset The outcome is the following (the list of keywords is not important for this question): 0 0.05013 list_of_topic_keywords_0 1 0.06444…