I would like to train and use a bayesian classifier for the following situation:
- Semi-structured data - basically an XML schema
- Information is contained in multiple plain text fields
- Some fields / parts of the schema may be repeated an arbitrary number of times
The classification itself is fairly simple - basically I need a probability of the document being in a specific category.
Design constraints:
- Solution must be either be open source, or available under another royalty-free license
- It must be possible to save / load classifiers for future use
- It must be possible to embed this library in a larger Java-based application (i.e. must work a a Java/JVM library)
Are there any libaries / tools that would fit this requirement?