Questions tagged [lda]

Latent Dirichlet Allocation, LDA, is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

If observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA represents documents as mixtures of topics that spit out words with certain probabilities.

It should not be confused with Linear Discriminant Analysis, a supervised learning procedure for classifying observations into a set of categories.

1175 questions
0
votes
0 answers

text mining in R- input is an Excel file with each row being one document

I am new to R. I have a CSV file that includes 15000 rows of text, each row belongs to one person. I want to do Latent Dirichlet Allocation on it. But, first I need to create a term document matrix. However, I don't know how to make R to treat each…
0
votes
1 answer

LDA with topicmodels package for R, how do I get the topic probability for each term?

I'm using the topicmodels package for LDA. I would like to create a visualization that shows how related or non-related each topic is. I envision a cluster of words that are unique to topic 1, but with a few keywords that are shared connecting to…
lmcshane
  • 1,074
  • 4
  • 14
  • 27
0
votes
1 answer

R in Windows cannot handle some characters

I performed LDA in Linux and didn't get characters like "ø" in topic 2. However, when run in Windows, they show. Does anyone know how to deal with this? I used packages quanteda and topicmodels. > terms(LDAModel1,5) Topic 1 Topic 2 [1,] "car" …
user1569341
  • 333
  • 1
  • 6
  • 17
0
votes
1 answer

Collapsed gibbs sampling in R package lda

I’ve been trying to modify parts the R package lda, specifically the slda.em function. At some point, the C function "collapsedGibbsSampler” gets called in slda.collapsed.gibbs.sampler. Does anyone have the C code for that function? I've looked…
user2592729
  • 429
  • 5
  • 16
0
votes
1 answer

How do i identify which features are being selected with LDA?

I have run LDA with MATLAB using the fitcdiscr function and predict. I have a feeling there may be some bugs in my code however and as a sanity check would like to identify which features are being most heavily weighted in the classification. Can…
JP1
  • 731
  • 1
  • 10
  • 27
0
votes
0 answers

How does spark LDA handle non-integer token counts (e.g. TF-IDF)

I have been running a series of topic modeling experiments in Spark, varying the number of topics. So, given an RDD docsWithFeatures, I'm doing something like this: for (n_topics <- Range(65,301,5) ){ val s = n_topics.toString val lda = new…
moustachio
  • 2,924
  • 3
  • 36
  • 68
0
votes
2 answers

LDA in Spark 1.3.1. Converting raw data into Term Document Matrix?

I'm trying out LDA with Spark 1.3.1 in Java and got this error: Error: application failed with exception org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in…
user1569341
  • 333
  • 1
  • 6
  • 17
0
votes
1 answer

topic proportions in my corpus?

Thanks for reading and taking the time to think about and respond to this. I am using Gensim's wrapper for Mallet (ldamallet.py), and it works like a charm. I need to get the topic proportions for my corpus (over all my documents) and I do not know…
JRun
  • 669
  • 1
  • 10
  • 17
0
votes
1 answer

the accuracy of LDA predict for new documents with Spark

I'm work with Mllib of Spark, and now is doing something with LDA. But when I use the code provided by Spark(see bellow) to predict a Doc used in training the model, the result(document-topics) of predict is at opposite poles with the result of…
Carlos
  • 1
  • 2
0
votes
1 answer

How to inference the topic distribution of a new document with LDA/pLSA?

I have a question when using topic models like pLSA/LDA: how to inference the topic distribution of a new document after we got the distribution for each words in each topics? I have tried "fold-in" Gibbs Sampling when using LDA, but when the unseen…
starays
  • 1
  • 2
0
votes
1 answer

Predicting topics with LDA

I am trying to extract topic assignments from a fit I build with R's 'lda' package. I created a fit: fit <- lda.collapsed.gibbs.sampler(documents = documents, K = K, vocab = vocab, num.iterations = G, alpha = alpha, eta = eta, initial = NULL, …
Sylvia
  • 315
  • 2
  • 17
0
votes
2 answers

Topic model as a dimension reduction method for text mining -- what to do next?

My understanding of the work flow is to run LDA -> Extract keywards (e.g. the top few words for each topics), and hence reduce dimension -> some subsequent analysis. My question is, if my overall purpose is to give topic to articles in an…
nobody
  • 815
  • 1
  • 9
  • 24
0
votes
1 answer

how to remove numbers and symbols from output of LDA while using Gensim package?

how to remove these numbers from output of LDA while using Gensim package? 2015-08-25 15:26:20,439 : INFO : topic #8 (0.100): 0.038*watch + 0.020*water + 0.014*strap + 0.011*analog + 0.011*resistance + 0.010*atm + 0.010*coloured + 0.010*timepiece +…
Thomas N T
  • 459
  • 1
  • 3
  • 14
0
votes
1 answer

Generating documents from LDA topic model

I'm learning a topic model from a set of documents and that's working well. But I'm wondering if any existing system will actually generate new documents from the topics and words in the model. Ie. say I want a new document of topic 0, will any of…
ten
  • 115
  • 1
  • 8
0
votes
0 answers

DocumentTermMatrix() return 0 terms in tm package

I have an object like that: str(apps) chr [1:17517] "35 44 33 40 33 40 44 38 33 37 37" ... In each row, the number is separated by space. corpus<-Corpus(VectorSource(apps)) dtm<-DocumentTermMatrix(corpus) str(dtm) List of 6 $ i : int(0) $…
ysfseu
  • 666
  • 1
  • 10
  • 20