Bayesian classification for semi-structured data in Java

Question

I would like to train and use a bayesian classifier for the following situation:

Semi-structured data - basically an XML schema
Information is contained in multiple plain text fields
Some fields / parts of the schema may be repeated an arbitrary number of times

The classification itself is fairly simple - basically I need a probability of the document being in a specific category.

Design constraints:

Solution must be either be open source, or available under another royalty-free license
It must be possible to save / load classifiers for future use
It must be possible to embed this library in a larger Java-based application (i.e. must work a a Java/JVM library)

Are there any libaries / tools that would fit this requirement?

I wish I could give a full answer but my search turns up BJN and bayesian networks. Does that help? — AncientSwordRage, Sep 11 '12 at 08:48
I've seen that many such libraries exist - but I was hoping that someone with experience using a few of them could say if any of them meet the requirements above (otherwise I and anyone else with a similar problem are going to have to waste a day or two testing / evaluating them....) — mikera, Sep 11 '12 at 09:05

score 1 · Accepted Answer · answered Sep 12 '12 at 20:28

I'm not sure whether you already have your classifier ready, but I've used Apache's UIMA framework for a couple of drawer projects. UIMA is "just" a framework, but does come with some logic. Some heavy-duty googling came up with an example bayesian classifier using UIMA.

It has mechanisms for modifying your configurations at runtime, but I'm also a bit unclear as to what you mean by "save and load classifiers". Does this mean that you have an array of binary classifiers you would like to load (and unload) at runtime, or do you have different models that you would like to load/unload?

The answers to your other questions are:

yes, UIMA is open source, released under ASLv2
yes, you can embed UIMA as a library in your application.

Bayesian classification for semi-structured data in Java

1 Answers1