Questions tagged [lda]

Latent Dirichlet Allocation, LDA, is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

If observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA represents documents as mixtures of topics that spit out words with certain probabilities.

It should not be confused with Linear Discriminant Analysis, a supervised learning procedure for classifying observations into a set of categories.

1175 questions
7
votes
1 answer

How do you initialize a gensim corpus variable with a csr_matrix?

I have X as a csr_matrix that I obtained using scikit's tfidf vectorizer, and y which is an array My plan is to create features using LDA, however, I failed to find how to initialize a gensim's corpus variable with X as a csr_matrix. In other words,…
IssamLaradji
  • 6,637
  • 8
  • 43
  • 68
7
votes
2 answers

Implementing alternative forms of LDA

I am using Latent Dirichlet Allocation with a corpus of news data from six different sources. I am interested in topic evolution, emergence, and want to compare how the sources are alike and different from each other over time. I know that there are…
user836015
6
votes
0 answers

Gensim lda gives negative log-perplexity value - is it normal and how can i interpret it?

I am currently using Gensim LDA for topic modeling. While Tuning hyper-parameters I found out that the model always gives negative log-perplexity Is it normal for model to behave like this?? (is it even possible?) if it is, is smaller perplexity…
nowheretogo
  • 125
  • 1
  • 5
6
votes
2 answers

How do I calculate the coherence score of an sklearn LDA model?

Here, best_model_lda is an sklearn based LDA model and we are trying to find a coherence score for this model.. coherence_model_lda = CoherenceModel(model = best_lda_model,texts=data_vectorized, dictionary=dictionary,coherence='c_v') coherence_lda =…
Arvind Sudheer
  • 113
  • 1
  • 1
  • 14
6
votes
1 answer

Should the "perplexity" (or "score") go up or down in the LDA implementation of Scikit-learn?

I'd like to know what does the perplexity and score means in the LDA implementation of Scikit-learn. Those functions are obscure. At the very least, I need to know if those values increase or decrease when the model is better. I've searched but it's…
Guillaume Chevalier
  • 9,613
  • 8
  • 51
  • 79
6
votes
1 answer

Automatic labeling of LDA generated topics

I'm trying to categorize customer feedback and I ran an LDA in python and got the following output for 10 topics: (0, u'0.559*"delivery" + 0.124*"area" + 0.018*"mile" + 0.016*"option" + 0.012*"partner" + 0.011*"traffic" + 0.011*"hub" +…
Arman
  • 827
  • 3
  • 14
  • 28
6
votes
1 answer

LDA TopicModels producing list of numbers rather than terms

Bear with me as I am extremely new to this and working on a project for a course in a certificate program. I have .csv dataset that I obtained by retrieving bibliometric records from Pubmed and Embase databases. There are 1034 rows. There are…
SciLibby
  • 63
  • 2
6
votes
1 answer

Use topic modeling information from LDA as features to perform text classification through SVM

I want to perform text classification using topic modeling information as features that are fed to an svm classifier. So I was wondering how is it possible to generate topic modeling features by performing LDA on both the training and test…
asterix
  • 191
  • 2
  • 13
6
votes
1 answer

LDA interpretation

I use the HMeasure package to involve the LDA in my analysis about credit risk. I have 11000 obs and I've chosen age and income to develop the analysis. I don't know exactly how to interpret the R results of LDA. So, I don't know if I chosen the…
Dalila
  • 181
  • 2
  • 2
  • 8
6
votes
1 answer

finding number of documents per topic for LDA with scikit-learn

I'm following along with the scikit-learn LDA example here and am trying to understand how I can (if possible) surface how many documents have been labeled as having each one of these topics. I've been poring through the docs for the LDA model here…
user139014
  • 1,445
  • 2
  • 19
  • 33
6
votes
1 answer

Understanding LDA Transformed Corpus in Gensim

I tried to examine the contents of the BOW corpus vs. the LDA[BOW Corpus] (transformed by LDA model trained on that corpus with, say, 35 topics) I found the following output: DOC 1 : [(1522, 1), (2028, 1), (2082, 1), (6202, 1)] LDA 1 : [(29,…
Ravi Karan
  • 445
  • 1
  • 7
  • 13
6
votes
3 answers

Are there any efficient python libraries for Dynamic Topic Models, preferably extending Gensim?

I'm trying to model twitter stream data with topic models. Gensim, being an easy to use solution, is impressive in it's simplicity. It has a truly online implementation for LSI, but not for LDA. For a changing content stream like twitter, Dynamic…
Ravi Karan
  • 445
  • 1
  • 7
  • 13
6
votes
2 answers

Latent Dirichlet Allocation Solution Example

I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and based on this blog post http://goo.gl/ccPvE I was able to develop the intuition behind LDA. However I still haven't…
user737128
  • 169
  • 1
  • 7
5
votes
4 answers

Convert one-document-per-line to Blei's lda-c/dtm format for topic modeling?

I am doing Latent Dirichlet Analyses for some research and keep running into a problem. Most lda software requires documents to be in doclines format, meaning a CSV or other delimited file in which each line represents the entirety of a document.…
user836015
5
votes
1 answer

Genism Module attribute error for wrappers

I am going to find the optimal number of topics for LDA. To do this, I used GENSIM as follows : def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3): coherence_values = [] model_list = [] for num_topics in…