4

Right now, I'm using LDA topic modelling tool from the MALLET package to do some topic detection on my documents. Everything's fine initially, I got 20 topics from it. However, when I try to infer new document using the model, the result is kinda baffling.

For instance I deliberately run my model over a document that I manually created which contains nothing but keywords from one of the topics "FLU", but the topic distributions I got was <0.1 for every topic. I then try the same thing on one of the already sampled document which has a high score of 0.7 for one of the topics. Again the same thing happened.

Can someone give some clue on the reason?

Tried asking on MALLET mailing list but apparently no one has replied.

Adinia
  • 3,722
  • 5
  • 40
  • 58
goh
  • 27,631
  • 28
  • 89
  • 151
  • When you say that you run your model over the document you created, what exactly are you doing? Are you attempting to re-run the inference portion of the LDA algorithm on the new document? If so, your result would be expected behavior. It sounds like you are trying to train a new model based solely upon the new document. Could you reply with your actual command? The LDA algorithm does not accept new documents into the topic distributions without needing to infer over all the original documents as well as it is an algorithm over a collection of documents. – user1698895 Jul 06 '11 at 00:12

4 Answers4

2

I also know very little about MALLET, but the docs mention this...

Topic Inference

--inferencer-filename [FILENAME] Create a topic inference tool based on the current, trained model. Use the MALLET command bin/mallet infer-topics --help to get information on using topic inference.

Note that you must make sure that the new data is compatible with your training data. Use the option --use-pipe-from [MALLET TRAINING FILE] in the MALLET command bin/mallet import-file or import-dir to specify a training file.

Maybe you forgot to do this? It does sound to me like the data you are training on is not in the same format as the data you are testing on.

Stompchicken
  • 15,833
  • 1
  • 33
  • 38
  • hey @stompchicken, have tried with the --use-pipe-from command. – goh Dec 07 '10 at 10:14
  • Oh well. In that case I don't have a clue. If you can, try inspecting the training and test data to make sure the documents are being represented in the same way. – Stompchicken Dec 07 '10 at 10:23
2

I had the same difficulty of Mallet. Later I found the problem is that the documents must be read in through the Pipe that was once used to read in the training documents.

Here is the sample to read in training documents:

ImportExample importerTrain = new ImportExample();//this is an example class in MALLET to import docs.   
InstanceList training= importer.readDirectory(new File(trainingDir));
training.save(new File(outputFile));

While reading in docs in topic inference:

InstanceList training = InstanceList.load(new File(outputFile));
Pipe pipe = training.getPipe();
ImportExample importer = new ImportExample();
importer.pipe = pipe; //use the same pipe
InstanceList testing = importer.readDirectory(new File(testDir));

I got my clue from one question posted in their archive:http://thread.gmane.org/gmane.comp.ai.mallet.devel/829

Arrika
  • 21
  • 1
1

Disclosure: I'm familiar with the techniques and the math generally used for topic inference, but I have minimal exposure to MALLET.
I hope these semi-educated guesses lead you to a solution. No warranty ;-)

I'm assuming you are using the mallet command hlda for training the model.
A few things that may have gone wrong:

  • Ensure you used the --keep-sequence option during the import phase of the process. By default mallet saves the inputs as plain Bags of Words, loosing the order in which the words are originally found. This may be ok for basic classification tasks but not for topic modeling.
  • Remember that the Gibbs sampling used by mallet is a stochastic process; expect variations in particular with small samples. During tests you may want to specify the same random seed for each iteration to ensu
  • What is the size of your training data? 20 topics seems a lot for initial tests which are typically based on small, manually crafted and/or quickly assembled training and testing sets.
  • remember that topic inference is based on sequences of words, not isolated keywords (your description of the manually crafted test document mentions "keywords" rather than say "expressions" or "phrases")
mjv
  • 73,152
  • 14
  • 113
  • 156
  • Hi @mjv, i used the command "train-topics" actually (i assume its using parallelTopicModel). 1. yes, I used the --keep-sequence option. 2. My training data consists of 8000+ documents. I believe that is an adequate dataset? – goh Dec 07 '10 at 09:42
  • 3. Isn't LDA based on bag-of-words? But even if my "manually crafted test document" does not produce the results, shouldn't the "already sampled document" produce a somewhat similar results during inference to that of it's original topic distribution? – goh Dec 07 '10 at 09:45
0

Here's how I infer topic distributions for new documents using MALLET. I thought I would post since I have been looking how to do this and there are a lot of answers, but none of them are comprehensive. This includes the training steps as well so you get an idea of how the different files connect to each other.

Create your training data:

$BIN_DIR/mallet import-file --input $DIRECTORY/data.input --output $DIRECTORY/data.mallet --keep-sequence --token-regex '\w+'

where data.input is a document containing your file ID, label, and a sequence of tokens or token IDs. Then train your model on this data with the parameters you like. For example:

$BIN_DIR/mallet train-topics --input $DIRECTORY/data.mallet \
      --num-topics $TOPICS --output-state $DIRECTORY/topic-state.gz \
      --output-doc-topics $DIRECTORY/doc-topics.gz \
      --output-topic-keys $DIRECTORY/topic-words.gz --num-top-words 500 \
      --num-iterations 1000

Later, you can create an inferencer using your trained model and training data:

bin/mallet train-topics --input $DIRECTORY/data.mallet --num-topics NUMBER --input-state $DIRECTORY/topic-state.gz --no-inference --inferencer-filename $DIRECTORY/inferencer-model

Now, create file for new documents using pipe from training data:

bin/mallet import-file --input $DIRECTORY/new_data.input --output $DIRECTORY/new_data.mallet --use-pipe-from $DIRECTORY/data.mallet --keep-sequence --token-regex '\w+'

Infer topics on new documents:

bin/mallet infer-topics --inferencer $DIRECTORY/inferencer-model --input $DIRECTORY/new_data.mallet --output-doc-topics $DIRECTORY/new_data_doc_topics --num-iterations 1000