0

I have a RDF file that has 7MB and ~ 80k statements.

When starting the application, I have the following code, that retrieves a list of itens I need to show to the user:

           NodeIterator iterator = technologyModel.listObjectsOfProperty(subject);
           while (iterator.hasNext()) {
               RDFNode node = iterator.nextNode();
               myCollection.add(node.asLiteral().getString().trim());
           }

Note: This code works just fine and returns something about 3k results, and is the first time the "technologyModel" is accessed.

Obviously, before doing that, I have to load the dataset/model, and here is the problem.

Case (1) When I load the dataset/model from a RDF file, doing this:

    InputStream in = FileManager.get().open(ParamsHelper.sourceRDF);
    technologyModel.read(in, "RDF/XML-ABBREV");

the technologyModel seems instantly loaded and the first code posted runs in less than a second.

Case (2) However, when I try to load the model from a TDB database (previously loaded with the same RDF file used on first case), with this code:

    dataset = TDBFactory.createDataset(ParamsHelper.tdbBaseDir);
    dataset.begin(ReadWrite.READ) ;
    technologyModel = dataset.getNamedModel("http://a.example.biz/technology");
    dataset.end();

the technologyModel doesn´t seem to be instantly loaded, and even though the first code posted returns as expected, it runs in about 30 seconds at the first call.

If I call that same code after the first time, or, for example, insert another operation like technologyModel.listSubjects() before calling this code for the first time, it will run immediately, as expected.

It seems to me that on the second case, the model is really loaded only afthe the first operation it suffers. Does it make any sense?

I don´t want to keep my data in a RDF file, but rather have a TDB database storing the triples. That´s why the second option seems to fit me better.

Can anyone help me on this? I hope I could expose the problem correctly.

Thanks in advance.

1 Answers1

2

There are two effects here:

TDBFactory.createDataset doesn't loaded any data - it connects to the database. Data is loaded into memory (cached) as it is used so when you are doing listObjectsOfProperty the first time, all caches are cold and the database may well be slow. It will be quite sensitive to the hardware you are running on at this point.

The second is that Model API calls can have access patterns that are databse-unfriendly. It is better to use SPARQL on the dataset.

By the way: listObjectsOfProperty does not take a subject - it takes a property and can access a lotof the database. If myCollection is a set, then you may be adding a lot more than 3K items.

AndyS
  • 16,345
  • 17
  • 21
  • Hey AndyS, thanks for the reply! The first effect makes all the sense and I think that´s what it is happenning. About the second effect, I tried to change the Model API call for a SPARQL query, but it still took about 30 seconds to retrieve the (same) results. And sorry for the confusion but subject means the [property subject of DCTERMS](http://dublincore.org/documents/dcmi-terms/#terms-subject). And 3K is the result of the query. The collection indeed has less itens. – Jonas Arêas Feb 09 '15 at 21:01
  • AndyS, with that said, is there any solution so I can load my model immediately just like the code `technologyModel.read(in, "RDF/XML-ABBREV");` does in the other case? – Jonas Arêas Feb 09 '15 at 21:05
  • technologyModel.read will work on a TDB backed model if done inside a write transaction. – AndyS Feb 10 '15 at 16:41
  • But this method will load a RDF represented by an InputStream or the url parameter into the database, rather than load the database content, right? – Jonas Arêas Feb 10 '15 at 19:07
  • Model.read reads RDF triples into a Model. If the model backed by TDB (via dataset.getNamedModel) it goes into the named graph in the database. Try it. Print he database in TriG to see the structure. – AndyS Feb 11 '15 at 20:14