0

I have a corpus of text with each line in the csv file uniquely specifying a "topic" I am interested in. If I were to run an topic model on this corpus using an LDA or Gibbs method from either the topicmodels package or lda, as expected I would get multiple topics per "document" (a line of text in my CSV which I have a-priori defined to be my unique topic of interest). I get that this is a result of the topic model's algorithm and the bag of words assumption.

What I am curious about however is this

1) Is there a pre-fab'd package in R that is designed for the user to specify the topics using the empirical word distribution? That is, I don't want the topics to be estimated; I want to tell R what the topics are. I suppose I could run a topic model with the correct number of Topics, use that structure of the object and then overwrite its contents. I was just hoping there was an easier or more obvious way that I'm just not seeing at this point.

Thoughts?

edit: added - I just thought about the alpha and beta parameters having control over the topic/term distributions within the LDA modeling algorithm. What settings might I be able to use that would force the model to only find 1 topic per document? Or is there a setting which would allow for that to occur?

If these seem like silly questions I understand - I'm quite new to this particular field and I am finding it fascinating.

Thomas
  • 43,637
  • 12
  • 109
  • 140
william
  • 1
  • 1

1 Answers1

0

What are you trying to accomplish with this approach? If you want to tell R what the topics are so it can predict the topics in other lines or documents, then RTextTools may be a helpful package.

Joshua Rosenberg
  • 4,014
  • 9
  • 34
  • 73
  • Great Suggestion! Yes I am attempting to supervise the topic generation and ultimately classify text based on my a-priori specified topics. I'd like the model to ultimately detect the presence of any topic and not just "sort" the documents (to be classified) into particular stacks however. That is to say, I fully expect the documents I am attempting to classify to contain multiple topics. I'll have a look at RTextTools - which so far looks like it might help me. – william Jun 16 '15 at 16:30
  • based on the reading thus far, it looks like these methods are assigning a document to a particular classification (stack if you will) without regard to the idea that a document could belong to multiple stacks. I'd like to stick with the assumption a document can contain multiple topics, but I want to supervise the training of a topic; I mean to say, I know what the topics are and need a way to reflect that for the sorting algorithm to correctly tell me what topics document A contains. – william Jun 16 '15 at 17:26
  • perhaps a reforming of my question would help: I have a massive collection of documents (40K+ or so) that have certain characteristics (topics) I am interested in. I have a single CSV file which contains descriptions of my unique topics of interest where each line of my CSV is a unique topic with a description field that has the text we defined to indicate what the row represents (there are over 200 total). I want an algorithm that would use the a-priori defined topics and tell me that document 16,743 contains Topic 1, 2 and 3 with probability 0.xx, 0.xx, and 0.xx respectively. – william Jun 16 '15 at 17:30
  • looks like structure topic models will handle what I'm looking to do. Although I am curious if there is an sLDA style function which will allow for categorical responses and not assumed continuous responses. – william Jun 16 '15 at 21:15