2

I have implemented a document classification tool using Mallet which classifies each page of a document to certain categories. I have tried Weka too but Mallet is smarter than Weka on this aspect. My approach is as below:

  1. Train pages of a document to known category
  2. Test few sample documents whether Mallet identifies pages of a certain category or not. Here Mallet matches from the test set with Known categories.
  3. if test is successful and satisfactory then run on huge document repository using classifier and mallet file.

This part is already implemented with good success rate.

For Text documents which I have not trained and different from known categories should be returned as NO Match, Mallet is trying to find match from training set for documents which are not known to Mallet.

For example I have 4 pages in a document. Page 1 belongs to class A, page 3 belongs to class B. Pages 2 and 4 do not belong to any classes. How to mark, pages 2 and 4 as 'NON Match' through Mallet?

Please help me to achieve this. Let me know if I am doing anything wrong or any other tool which can give me desired output.

InfoUser
  • 61
  • 6

1 Answers1

2

Two quick thoughts:

  1. You can give some threshold for the confidence value you want. For example, mallet is saying that Page 1 belongs to Class A with 90% confidence, accept it. If it is saying that Page 2 belongs to Class C, with 60% confidence, and that is the best value, may be, reject that suggestion. You can get the scores of classification through the function-getClassificationScores (documentation: http://mallet.cs.umass.edu/api/cc/mallet/classify/MaxEnt.html#getClassificationScores(cc.mallet.types.Instance, double[])

  2. You can you scikit-learn in python. I have heard that if it doesn't know which class your page belongs to, it will tell NA.

pnv
  • 1,437
  • 3
  • 23
  • 52
  • Thank you for your suggestion. I am already using the first point which you have mentioned. I have kept threshold i.e. 60%, below 60% confidence I am discarding. Need to go through scikit-learn tools and algorithms. – InfoUser Feb 06 '15 at 14:05