I am evaluating OpenNLP for use as a document categorizer. I have a sanitized training corpus with roughly 4k files, in about 150 categories. The documents have many shared, mostly irrelevant words - but many of those words become relevant in n-grams, so I'm using the following parameters:
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 20000);
params.put(TrainingParameters.CUTOFF_PARAM, 10);
DoccatFactory dcFactory = new DoccatFactory(new FeatureGenerator[] { new NGramFeatureGenerator(3, 10) });
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
Some of these categories apply to documents that are almost completely identical (think boiler-plate legal documents, with maybe only names and addresses different between document instances) - and will be mostly identical to documents in the test set. However, no matter how I tweak these params, I can't break out of the "1 outcome patterns" result. When running a test, every document in the test set is tagged with "Category A."
I did manage to effect a single minor change in output, by moving from previous use of the BagOfWordsFeatureGenerator to the NGramFeatureGenerator, and from maxent to Naive Bayes; before the change, every document in the test set was assigned "Category A", but after the change, all the documents were now assigned to "Category B." But other than that, I can't seem to move the dial at all.
I've tried fiddling with iterations, cutoff, ngram sizes, using maxent instead of bayes, etc; but all to no avail.
Example code from tutorials that I've found on the interweb have used much smaller training sets with less iterations, and are able to perform at least some rudimentary differentation.
Usually in such a situation - bewildering lack of expected behavior - the engineer has forgotten to flip some simple switch, or has some fatal lack of fundamental understanding. I am eminently capable of both those failures. Also, I have no Data Science training, although I have read a couple of O'Reilly books on the subject. So the problem could be procedural. Is the training set too small? Is the number of iterations off by an order of magnitude? Would a different algo be a better fit? I'm utterly surprised that no tweaks have even slightly moved the dial away from the "1 outcome" outcome.
Any response appreciated.