4

I would like to train and use a bayesian classifier for the following situation:

  • Semi-structured data - basically an XML schema
  • Information is contained in multiple plain text fields
  • Some fields / parts of the schema may be repeated an arbitrary number of times

The classification itself is fairly simple - basically I need a probability of the document being in a specific category.

Design constraints:

  • Solution must be either be open source, or available under another royalty-free license
  • It must be possible to save / load classifiers for future use
  • It must be possible to embed this library in a larger Java-based application (i.e. must work a a Java/JVM library)

Are there any libaries / tools that would fit this requirement?

mikera
  • 105,238
  • 25
  • 256
  • 415
  • I wish I could give a full answer but my search turns up BJN and bayesian networks. Does that help? – AncientSwordRage Sep 11 '12 at 08:48
  • Have you looked into Mahout, Weka, GATE NLP? – Sap Sep 11 '12 at 09:00
  • I've seen that many such libraries exist - but I was hoping that someone with experience using a few of them could say if any of them meet the requirements above (otherwise I and anyone else with a similar problem are going to have to waste a day or two testing / evaluating them....) – mikera Sep 11 '12 at 09:05

1 Answers1

1

I'm not sure whether you already have your classifier ready, but I've used Apache's UIMA framework for a couple of drawer projects. UIMA is "just" a framework, but does come with some logic. Some heavy-duty googling came up with an example bayesian classifier using UIMA.

It has mechanisms for modifying your configurations at runtime, but I'm also a bit unclear as to what you mean by "save and load classifiers". Does this mean that you have an array of binary classifiers you would like to load (and unload) at runtime, or do you have different models that you would like to load/unload?

The answers to your other questions are:

  • yes, UIMA is open source, released under ASLv2
  • yes, you can embed UIMA as a library in your application.
Steen
  • 6,573
  • 3
  • 39
  • 56