Questions tagged [lda]

Latent Dirichlet Allocation, LDA, is a generative model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar.

If observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics. LDA represents documents as mixtures of topics that spit out words with certain probabilities.

It should not be confused with Linear Discriminant Analysis, a supervised learning procedure for classifying observations into a set of categories.

1175 questions
9
votes
1 answer

Can we use a self made corpus for training for LDA using gensim?

I have to apply LDA (Latent Dirichlet Allocation) to get the possible topics from a data base of 20,000 documents that I collected. How can I use these documents rather than the other corpus available like the Brown Corpus or English Wikipedia as…
Animesh Pandey
  • 5,900
  • 13
  • 64
  • 130
8
votes
2 answers

Gensim LDA Coherence Score Nan

I created a Gensim LDA Model as shown in this tutorial: https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ lda_model = gensim.models.LdaMulticore(data_df['bow_corpus'], num_topics=10, id2word=dictionary, random_state=100,…
Ramsha Siddiqui
  • 460
  • 6
  • 20
8
votes
1 answer

Latent Dirichlet allocation (LDA) in Spark - replicate model

I want to save the LDA model from pyspark ml-clustering package and apply the model to the training & test data-set after saving. However results diverge despite setting a seed. My code is the following: 1) Import packages from…
raffaelo92
  • 115
  • 5
8
votes
2 answers

python scikit learn, get documents per topic in LDA

I am doing an LDA on a text data, using the example here: My question is: How can I know which documents correspond to which topic? In other words, what are the documents talking about topic 1 for example? Here are my steps: n_features =…
passion
  • 1,000
  • 6
  • 20
  • 47
8
votes
1 answer

Is there any way to match Gensim LDA output with topics in pyLDAvis graph?

I need to process the topics in the LDA output (lda.show_topics(num_topics=-1, num_words=100...) and then compare what I do with the pyLDAvis graph but the topic numbers are differently numbered. Is there a way I can match them?
m.khalil
  • 81
  • 4
8
votes
1 answer

Google Cloud Dataproc configuration issues

I've been encountering various issues in some Spark LDA topic modeling (mainly disassociation errors at seemingly random intervals) I've been running, which I think mainly have to do with insufficient memory allocation on my executors. This would…
8
votes
1 answer

Gensim get topic for a document (seen document)

I know that after training the lda model for gensim, we can get the topic for an unseen document by: lda = LdaModel(corpus, num_topics=10) doc_lda = lda[doc_bow] But how about the documents that are already used for training? I mean is there a way…
CentAu
  • 10,660
  • 15
  • 59
  • 85
8
votes
3 answers

How to print out the full distribution of words in an LDA topic in gensim?

The lda.show_topics module from the following code only prints the distribution of the top 10 words for each topic, how do i print out the full distribution of all the words in the corpus? from gensim import corpora, models documents = ["Human…
alvas
  • 115,346
  • 109
  • 446
  • 738
7
votes
3 answers

WordCloud Only Supported for TrueType fonts

I am trying to generate a word cloud using the WordCloud module in Python, however I see the following error whenever I call .generate Traceback (most recent call last): File "/mnt/6db3226b-5f96-4257-980d-bb8ec1dad8e7/test.py", line 4, in…
Matthew
  • 73
  • 1
  • 1
  • 3
7
votes
2 answers

pyLDAvis visualization from gensim not displaying the result in google colab

import pyLDAvis.gensim # Visualize the topics pyLDAvis.enable_notebook() vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word) vis The above code displayed the visualization of LDA model in google colab but then after reopening the notebook it…
Ravi Prajapati
  • 71
  • 1
  • 1
  • 4
7
votes
3 answers

Meaning of bar width for pyLDAvis for lambda = 0

Not sure if this is the right forum but I was wondering if anyone understands how to interpret the width of the red vs. blue bars on the right-hand side of pyLDAvis plots when lambda = 0 (see…
user3490622
  • 939
  • 2
  • 11
  • 30
7
votes
3 answers

How to get topic associated with each document using pyspark(2.1.0) LdA?

I am using LDAModel of pyspark to get topics from corpus. My goal is to find topics associated with each document. For that purpose I tried to set topicDistributionCol as per Docs. Since I am new to this, I am not sure what is the purpose of this…
7
votes
1 answer

Using LDA(topic model) : the distrubution of each topic over words are similar and "flat"

Latent Dirichlet Allocation(LDA) is a topic model to find latent variable (topics) underlying a bunch of documents. I'm using python gensim package and having two problems: I printed out the most frequent words for each topic (I tried 10,20,50…
Ruby
  • 284
  • 1
  • 5
  • 18
7
votes
1 answer

How can I speed up a topic model in R?

Background I am trying to fit a topic model with the following data and specification documents=140 000, words = 3000, and topics = 15. I am using the package topicmodels in R (3.1.2) on a Windows 7 machine (ram 24 GB, 8 cores). My problem is that…
7
votes
2 answers

Bug in scikit-learns LDA function - plots shows non-zero correlation

I did some LDA using scikit-learn's LDA function and I noticed in my resulting plots that there is a non-zero correlation between LDs. from sklearn.lda import LDA sklearn_lda = LDA(n_components=2) transf_lda = sklearn_lda.fit_transform(X, y) This…
user2489252