Questions tagged [topic-modeling]

Topic models describe the frequency of topics in documents and text. A "topic" is a group of words which tend to occur together.

A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats (source: wikipedia)

Generative models (i.e. the statistical models used for topic modelling)

  • Latent Dirichlet Allocation (LDA)
  • Hierarchical Dirichlet process (HDP)

Software / Libraries

Related Tags :

980 questions
3
votes
0 answers

Mallet LDA ArrayIndexOutOfBoundsException while training the model

I am trying to build a model with 500 or 1000 topics on a 1M document dataset with Mallet LDA. After 60 iterations I am getting an ArrayIndexOutOfBoundsException. The error message is as below: <60> LL/token: -7.64386 overflow on type…
ak.
  • 143
  • 9
3
votes
1 answer

After implementing topic modelling of a text file I am getting similar words to describe all the topics and the results are inaccurate.

from nltk.tokenize import RegexpTokenizer from stop_words import get_stop_words from gensim import corpora, models import gensim import os from os import path from time import sleep tokenizer = RegexpTokenizer(r'\w+') en_stop =…
Raj
  • 171
  • 1
  • 1
  • 8
3
votes
2 answers

How much time for a topic modeling via MALLET on 9GB corpus

I would like to do LDA topic modeling on a 9GB corpus. The plan is to train LDA model using MALLET for 1000 iterations with 100 topics, optimizing hyperparameters every 10 iterations after a 200 iteration burn-in period. I am working on 64-bit Win8,…
GileBrt
  • 1,830
  • 3
  • 20
  • 28
3
votes
2 answers

R: How to generate vectors of highest value in each row?

Let's say that my data frame contains > DF V1 V2 V3 1 0.3 0.4 0.7 2 0.4 0.2 0.1 3 0.2 0.8 0.3 4 0.5 0.8 0.9 5 0.2 0.7 0.8 6 0.8 0.3 0.6 7 0.1 0.5 0.4 the rows would be the different types of…
Elizabeth
  • 71
  • 2
  • 9
3
votes
0 answers

Perplexity in topic modeling

I have run the LDA using topic models package on my training data. How can I determine the perplexity of the fitted model? I read the instruction, but I am not sure which code I should use. Here's what I have so far: burnin <- 500 iter <- 1000 #keep…
3
votes
1 answer

ImportError: cannot import name BytesIO on eclipse

I am getting the following error and am just not able to figure out why gensim cant be imported. I tried reimporting gensim again by creating virtual environment but that didnt work as well. I am new to python, please be generous. Traceback (most…
Tejasvi Rao
  • 53
  • 1
  • 6
3
votes
1 answer

Memory efficient LDA training using gensim library

Today I just started writing an script which trains LDA models on large corpora (minimum 30M sentences) using gensim library. Here is the current code that I am using: from gensim import corpora, models, similarities, matutils def…
amin
  • 445
  • 1
  • 4
  • 14
3
votes
0 answers

Spark LDA woes - prediction and OOM questions

I'm evaluating Spark 1.6.0 to build and predict against large (millions of docs, millions of features, thousands of topics) LDA models, something I can accomplish pretty easily with Yahoo! LDA. Starting small, following the Java examples, I built…
3
votes
1 answer

Getting term weights out of an LDA model in R

I was wondering if anyone knows of a way to extract term weights / probabilities out of a topic model constructed in R, using the topicmodels package. Following the example in the following link I created a topic model like so: Gibbs = LDA(JSS_dtm,…
IVR
  • 1,718
  • 2
  • 23
  • 41
3
votes
1 answer

Plot the evolution of an LDA topic across time

I'd like to plot how the proportion of a particular topic changes over time, but I've been having some trouble isolating a single topic and plotting over time, especially for plotting multiple groups of documents separately (let's create two groups…
mlinegar
  • 1,389
  • 1
  • 11
  • 19
3
votes
1 answer

Mallet - Topic Modeling - Stopwords Error

Although i add extra stopwords list and default stopwords list when i use MALLET for topic modeling, some stop words appear in topic models. For example "ın", "ıf", "ıt". How do i ensure that this stopwords don't appear in topic models? Topic models…
bubunny
  • 39
  • 5
3
votes
1 answer

Training a LDA model with gensim from some external tf-idf matrix and term list

I have a tf-idf matrix already, with rows for terms and columns for documents. Now I want to train a LDA model with the given terms-documents matrix. The first step seems to be using gensim.matutils.Dense2Corpus to convert the matrix into the corpus…
Ziyuan
  • 4,215
  • 6
  • 48
  • 77
3
votes
1 answer

R LDA Topic Modeling: Result topics contains very similar words

All: I am beginner in R topic modeling, it all started three weeks ago. So my problem is I can successfully processed my data into corpus, Document term matrix and LDA function. I have tweets as my input and about 460,000 tweets. But I am not happy…
3
votes
1 answer

MALLET Topic Modeling: input String

I have this code to import a file .mallet: File f=new File("/home/test/file.mallet"); InstanceList t=InstanceList.load(f); but if I wanted to switch manually every single instance, how could I do? I tried this: String str="Test for…
Enzo
  • 597
  • 1
  • 8
  • 22
3
votes
2 answers

Mallet in R regex error :java.lang.NoSuchMethodException: No suitable method for the given parameters

Ive been following the tutorial on how to use mallet in R to create topic models. My text file has 1 sentence per line. It looks like this and has about 50 sentences. Thank you again and have a good day :). This is an apple. This is awesome! LOL! i…
jxn
  • 7,685
  • 28
  • 90
  • 172