I'm using the Java API of Apache Jena to store and retrieve documents and the words within them. For this I decided to set up the following datastructure:
_dataset = TDBFactory.createDataset("./database");
_dataset.begin(ReadWrite.WRITE);
Model model = _dataset.getDefaultModel();
Resource document= model.createResource("http://name.space/Source/DocumentA");
document.addProperty(RDF.value, "Document A");
Resource word = model.createResource("http://name.space/Word/aword");
word.addProperty(RDF.value, "aword");
Resource resource = model.createResource();
resource.addProperty(RDF.value, word);
resource.addProperty(RSS.items, "5");
document.addProperty(RDF.type, resource);
_dataset.commit();
_dataset.end();
The code example above represents a document ("Document A") consisting of five (5) words ("aword"). The occurences of a word in a document are counted and stored as a property. A word can also occur in other documents, therefore the occurence count relating to a specific word in a specific document is linked together by a blank node. (I'm not entirely sure if this structure makes any sense as I'm fairly new to this way of storing information, so please feel free to provide better solutions!)
My major question is: How can I get a list of all distinct words and the sum of their occurences over all documents?