Questions tagged [lda]

Latent Dirichlet Allocation, LDA, is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

If observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA represents documents as mixtures of topics that spit out words with certain probabilities.

It should not be confused with Linear Discriminant Analysis, a supervised learning procedure for classifying observations into a set of categories.

1175 questions
16
votes
1 answer

Getting topic-word distribution from LDA in scikit learn

I was wondering if there is a method in the LDA implementation of scikit learn that returns the topic-word distribution. Like the genism show_topics() method. I checked the documentation but didn't find anything.
Niro
  • 433
  • 3
  • 21
16
votes
1 answer

How to monitor convergence of Gensim LDA model?

I can't seem to find it or probably my knowledge on statistics and its terms are the problem here but I want to achieve something similar to the graph found on the bottom page of the LDA lib from PyPI and observe the uniformity/convergence of the…
ZeferiniX
  • 500
  • 5
  • 18
16
votes
3 answers

Extract document-topic matrix from Pyspark LDA Model

I have successfully trained an LDA model in spark, via the Python API: from pyspark.mllib.clustering import LDA model=LDA.train(corpus,k=10) This works completely fine, but I now need the document-topic matrix for the LDA model, but as far as I can…
moustachio
  • 2,924
  • 3
  • 36
  • 68
15
votes
2 answers

How to get a complete topic distribution for a document using gensim LDA?

When I train my lda model as such dictionary = corpora.Dictionary(data) corpus = [dictionary.doc2bow(doc) for doc in data] num_cores = multiprocessing.cpu_count() num_topics = 50 lda = LdaMulticore(corpus, num_topics=num_topics, id2word=dictionary,…
PyRsquared
  • 6,970
  • 11
  • 50
  • 86
15
votes
1 answer

Spark LDA consumes too much memory

I'm trying to use spark mllib lda to summarize my document corpus. My problem setting is as bellow. about 100,000 documents about 400,000 unique words 100 cluster I have 16 servers (each has 20 cores and 128GB memory). When I execute LDA with…
Du Shiqiao
  • 377
  • 1
  • 9
15
votes
1 answer

How to interpret LDA components (using sklearn)?

I used Latent Dirichlet Allocation (sklearn implementation) to analyse about 500 scientific article-abstracts and I got topics containing most important words (in german language). My problem is to interpret these values associated with the most…
LSz
  • 161
  • 1
  • 6
14
votes
3 answers

Evaluation of topic modeling: How to understand a coherence value / c_v of 0.4, is it good or bad?

I need to know whether coherence score of 0.4 is good or bad? I use LDA as topic modelling algorithm. What is the average coherence score in this context?
User Mohamed
  • 169
  • 1
  • 1
  • 4
14
votes
1 answer

Understanding parameters in Gensim LDA Model

I am using gensim.models.ldamodel.LdaModel to perform LDA, but I do not understand some of the parameters and cannot find explanations in the documentation. If someone has experience working with this, I would love further details of what these…
Jane Sully
  • 3,137
  • 10
  • 48
  • 87
14
votes
1 answer

Latent Dirichlet allocation (LDA) in Spark

I am trying to write a progrma in Spark for carrying out Latent Dirichlet allocation (LDA). This Spark documentation page provides a nice example for perfroming LDA on the sample data. Below is the program from pyspark.mllib.clustering import LDA,…
prashanth
  • 4,197
  • 4
  • 25
  • 42
14
votes
1 answer

Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

i am interested in applying LDA topic modelling using Spark MLlib. I have checked the code and the explanations in here but I couldn't find how to use the model then to find the topic distribution in a new unseen document.
Rami
  • 8,044
  • 18
  • 66
  • 108
14
votes
7 answers

Hierarchical Dirichlet Process Gensim topic number independent of corpus size

I am using the Gensim HDP module on a set of documents. >>> hdp = models.HdpModel(corpusB, id2word=dictionaryB) >>> topics = hdp.print_topics(topics=-1, topn=20) >>> len(topics) 150 >>> hdp = models.HdpModel(corpusA, id2word=dictionaryA) >>> topics…
Sam Weisenthal
  • 2,791
  • 9
  • 28
  • 66
14
votes
3 answers

Supervised Latent Dirichlet Allocation for Document Classification?

I have a bunch of already human-classified documents in some groups. Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?
snøreven
  • 1,904
  • 2
  • 19
  • 39
13
votes
2 answers

pyLDAvis visualization of pyspark generated LDA model

Does anyone have an example of data visualization of an LDA model trained using the PySpark library (specifically using pyLDAvis)? I've seen a lot of examples for GenSim and other libraries but not PySpark. Specifically I'm wondering what to pass…
igodfried
  • 877
  • 9
  • 22
12
votes
2 answers

Gensim LDA topic assignment

I am hoping to assign each document to one topic using LDA. Now I realise that what you get is a distribution over topics from LDA. However as you see from the last line below I assign it to the most probable topic. My question is this. I have to…
sachinruk
  • 9,571
  • 12
  • 55
  • 86
11
votes
3 answers

ImportError: No module named 'sklearn.lda'

When I run classifier.py in the openface demos directory using: classifier.py train ./generated-embeddings/ I get the following error message: --> from sklearn.lda import LDA ModuleNotFoundError: No module named 'sklearn.lda'. I think to have…
mauroV8F5
  • 137
  • 1
  • 1
  • 6
1
2
3
78 79