Convert one-document-per-line to Blei's lda-c/dtm format for topic modeling?

Question

I am doing Latent Dirichlet Analyses for some research and keep running into a problem. Most lda software requires documents to be in doclines format, meaning a CSV or other delimited file in which each line represents the entirety of a document. However, Blei's lda-c and dynamic topic model software requires that data be in the format: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count] where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string.

Does anyone know of a utility that will let me quickly convert to this format? Thank you.

I meet similar problems, do you happen to find the solutions? Thanks. — user288609, Mar 09 '12 at 22:26
I have not implemented it yet, but [this Python utility](https://github.com/JoKnopp/text2ldac) was posted to the topic models mailing list and is supposed to take text files and convert them to the correct format. — , Mar 10 '12 at 15:47

Ben · Answer 1 · 2013-04-28T08:08:30.177

If you are working with R, the lda package contains a function lexicalize that will convert raw text into the lda-c format necessary for the lda package.

example <- c("I am the very model of a modern major general",
             "I have a major headache")

corpus <- lexicalize(example, lower=TRUE)

Similarly, the topicmodels package has a function dtm2ldaformat that will convert a document term matrix to the lda format. You can convert a plain text document into a document term matrix using the tm package, also in R.

So with these existing functions there's a lot of flexibility in getting text into R for topic modelling.

score 2 · Answer 2 · answered Feb 25 '13 at 08:52

The Mallet package from University of Massachusetts Amherst is another option.

And here is an excellent step-by-step demo on how to use Mallet:

http://programminghistorian.org/lessons/topic-modeling-and-mallet

You can use mallet with just normal text files as input source.

score 1 · Answer 3 · answered Jan 04 '13 at 15:29

1

Gensim offers an implementation of Blei's corpus format. See here. You could write a quick corpus based on your CSV file in Python and then save it in lda-c with gensim. It should not be too hard.

answered Jan 04 '13 at 15:29

Karsten

882
6
18

score 0 · Answer 4 · answered May 16 '18 at 04:13

0

For Python, there is an available function for this(may not be available at the time of the question).

lda.utils.dtm2ldac

The document is https://pythonhosted.org/lda/api.html#module-lda.utils

answered May 16 '18 at 04:13

Lei Hao

708
1
7
21

Convert one-document-per-line to Blei's lda-c/dtm format for topic modeling?

4 Answers4