5

I am doing Latent Dirichlet Analyses for some research and keep running into a problem. Most lda software requires documents to be in doclines format, meaning a CSV or other delimited file in which each line represents the entirety of a document. However, Blei's lda-c and dynamic topic model software requires that data be in the format: [M] [term_1]:[count] [term_2]:[count] ... [term_N]:[count] where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document. Note that [term_1] is an integer which indexes the term; it is not a string.

Does anyone know of a utility that will let me quickly convert to this format? Thank you.

Anders R. Bystrup
  • 15,729
  • 10
  • 59
  • 55
  • I meet similar problems, do you happen to find the solutions? Thanks. – user288609 Mar 09 '12 at 22:26
  • 1
    I have not implemented it yet, but [this Python utility](https://github.com/JoKnopp/text2ldac) was posted to the topic models mailing list and is supposed to take text files and convert them to the correct format. –  Mar 10 '12 at 15:47

4 Answers4

3

If you are working with R, the lda package contains a function lexicalize that will convert raw text into the lda-c format necessary for the lda package.

example <- c("I am the very model of a modern major general",
             "I have a major headache")

corpus <- lexicalize(example, lower=TRUE) 

Similarly, the topicmodels package has a function dtm2ldaformat that will convert a document term matrix to the lda format. You can convert a plain text document into a document term matrix using the tm package, also in R.

So with these existing functions there's a lot of flexibility in getting text into R for topic modelling.

Ben
  • 41,615
  • 18
  • 132
  • 227
2

The Mallet package from University of Massachusetts Amherst is another option.

And here is an excellent step-by-step demo on how to use Mallet:

You can use mallet with just normal text files as input source.

Mountain
  • 211
  • 3
  • 11
1

Gensim offers an implementation of Blei's corpus format. See here. You could write a quick corpus based on your CSV file in Python and then save it in lda-c with gensim. It should not be too hard.

Karsten
  • 882
  • 6
  • 18
0

For Python, there is an available function for this(may not be available at the time of the question).

lda.utils.dtm2ldac

The document is https://pythonhosted.org/lda/api.html#module-lda.utils

Lei Hao
  • 708
  • 1
  • 7
  • 21