0

I'm using the Java API of Apache Jena to store and retrieve documents and the words within them. For this I decided to set up the following datastructure:

_dataset = TDBFactory.createDataset("./database");
_dataset.begin(ReadWrite.WRITE);

Model model = _dataset.getDefaultModel();
Resource document= model.createResource("http://name.space/Source/DocumentA");
document.addProperty(RDF.value, "Document A");

Resource word = model.createResource("http://name.space/Word/aword");
word.addProperty(RDF.value, "aword");

Resource resource = model.createResource();
resource.addProperty(RDF.value, word);
resource.addProperty(RSS.items, "5");

document.addProperty(RDF.type, resource);

_dataset.commit();
_dataset.end();

The code example above represents a document ("Document A") consisting of five (5) words ("aword"). The occurences of a word in a document are counted and stored as a property. A word can also occur in other documents, therefore the occurence count relating to a specific word in a specific document is linked together by a blank node. (I'm not entirely sure if this structure makes any sense as I'm fairly new to this way of storing information, so please feel free to provide better solutions!)

My major question is: How can I get a list of all distinct words and the sum of their occurences over all documents?

manidu
  • 25
  • 1
  • 5

1 Answers1

2

Your data model is a bit unconventional, in my opinion. With your code, you'll end up with data that looks like this (in Turtle notation), and which uses rdf:type and rdf:value in unconventional ways:

:doc rdf:value "document a" ;
     rdf:type :resource .
:resource rdf:value :word ;
          :items 5 .
:word rdf:value "aword" .

It's unusual, because usually you wouldn't have such complex information on the type attribute of a resource. From the SPARQL standpoint though, rdf:type and rdf:value are properties just like any other, and you can still retrieve the information you're looking for with a simple query. It would look more or less like this (though you'll need to define some prefixes, etc.):

select ?word (sum(?n) as ?nn) where {
  ?document rdf:type ?type .
  ?type rdf:value/rdf:value ?word ;
        :items ?n .
}
group by ?word

That query will produce a result for each word, and with each will be the sum of all the values of the :items properties associated with the word. There are lots of questions on Stack Overflow that have examples of running SPARQL queries with Jena. E.g., (the first one that I found with Google): Query Jena TDB store.

Community
  • 1
  • 1
Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
  • Nevermind the chosen properties, I just took what sounded good for what I wanted it to mean just to try it out. Regarding the suggested SPARQL query: I actually came up with the same, it just didn't work so I thought I must be doing something wrong. Meanwhile, I found the reason why it didn't work: `resource.addProperty(RSS.items, "5")` has to be changed to `resource.addProperty(RSS.items, "5", XSDDatatype.XSDint)` or otherwise the `SUM()` in the query will apparently take the values as `String` and return nothing. – manidu Jan 21 '15 at 10:07
  • @manidu Yeah, you can't add strings. Rather than addProperty(), though, it might be more convenient to just use [addLiteral](https://jena.apache.org/documentation/javadoc/jena/com/hp/hpl/jena/rdf/model/Resource.html#addLiteral(com.hp.hpl.jena.rdf.model.Property, long)), which would just be `addLiteral(RSS.items,5)`. No strings at all, just the actual number. – Joshua Taylor Jan 21 '15 at 13:34