Jena ARQ/TDB Query Optimization

Question

I have a rather small graph containing roughly 500k triples. I've also generated the stats.opt file and running my code on a rather fast computer (quad core, 16gb ram, ssd drive.) But for the query I'm building with the help of the OP interface, it takes forever to iterate over the resultset. The resultset has about 15000 lines and the iteration takes 4s, which is unacceptable for endusers. Executing the query takes merely 90ms (I guess the real work is done by the cursor iteration?). Why is this so slow and what can I do to speed up the result set iteration?

Here is the query:

SELECT  ?apartment ?price ?hasBalcony ?lat ?long ?label ?hasImage ?park ?supermarket ?rooms ?area ?street
WHERE
  { ?apartment dssd:hasBalcony ?hasBalcony .
    ?apartment wgs84:lat ?lat .
    ?apartment wgs84:long ?long .
    ?apartment rdfs:label ?label .
    ?apartment dssd:hasImage ?hasImage .
    ?apartment dssd:hasNearby ?hasNearbyPark .
    ?hasNearbyPark dssd:hasNearbyPark ?park .
    ?apartment dssd:hasNearby ?hasNearbySupermarket .
    ?hasNearbySupermarket dssd:hasNearbySupermarket ?supermarket .
    ?apartment dssd:price ?price .
    ?apartment dssd:rooms ?rooms .
    ?apartment dssd:area ?area .
    ?apartment vcard:hasAddress ?address .
    ?address vcard:streetAddress ?street
    FILTER ( ?hasBalcony = true )
    FILTER ( ?price <= 1000.0e0 )
    FILTER ( ?price >= 650.0e0 )
    FILTER ( ?rooms <= 4.0e0 )
    FILTER ( ?rooms >= 3.0e0 )
    FILTER ( ?area <= 100.0e0 )
    FILTER ( ?area >= 60.0e0 )
  }

(Is there a better way to query those bnodes: ?hasNearbyPark, ?hasNearbySupermarket)

And the code to execute the query:

dataset.begin(ReadWrite.READ);
Model model = dataset.getNamedModel("http://example.com");
QueryExecution queryExecution = QueryExecutionFactory.create(buildQuery(), model);
ResultSet resultSet = queryExecution.execSelect();
while ( resultSet.hasNext() ) {
    QuerySolution solution = resultSet.next(); ...

You can use `?apartment dssd:hasNearby [ dssd:hasNearbyPark ?park ]` and `?apartment dssd:hasNearby [ dssd:hasNearbySupermarket ?supermarket ]` for the park and supermarket, and likewise for `?address`, which it doesn't seem like you use, except for getting the `?street`. You can save a bunch of typing with `;`, too. E.g., instead of `?apartment wgs84:lat ?lat .` ?apartment wgs84:long ?long .`, use `?apartment wgs84:lat ?lat ; wgs84:long ?long ; ...`. — Joshua Taylor, Aug 29 '13 at 18:00
Well I'm building the query programmatically with OP so, i don't have any influence on this. Would this change execution time? — Daniel Gerber, Aug 29 '13 at 18:03
Instead of `FILTER ( ?hasBalcony = true )`, you should probably just query with `?apartment dssd:hasBalcony true .`. Is there any performance difference if you start combining filter expressions, e.g., `FILTER ( ?price <= 1000.0e0 && ?price >= 650.0e0 )`. Also, for the rooms, if that's an integer value, perhaps you can use `{?apt rooms 3} UNION {?apt rooms 4}`. — Joshua Taylor, Aug 29 '13 at 18:04
The blank node syntax is equivalent. Writing by hand, the blank node syntax might be more aesthetic, but it's not any more or less efficient. Same goes for shortening with `;`. However, the bit about 3 and 4 rooms with a `union` instead of a filter, and `hasBalcony true` instead of `filter( hasBalcony = true )` could speed things up, as they'd constrain the matches more, and there'd be less to filter later. Similarly, combining filter expressions might be better, but I don't know whether or not that already happens automatically. It's something to try, anyhow. — Joshua Taylor, Aug 29 '13 at 18:11
I just tried out your suggestions, but the query execution / iteration time did not alter. :/ thanx anyway (sorry cant delete comment any more) — Daniel Gerber, Aug 29 '13 at 18:28

score 3 · Answer 1 · answered Aug 29 '13 at 19:31

3

On the ARQ Query Engine

First off you seem to be misunderstanding how the ARQ engine works:

ResultSet resultSet = queryExecution.execSelect();

All the above does is prepare a query plan for how the engine will evaluate the query, it does not actually evaluate the query hence why it is almost instantaneous.

Actual work on answering your question does not happen until you start calling hasNext() and next():

while ( resultSet.hasNext() ) {
   QuerySolution solution = resultSet.next(); ...

So the timings you quote are incorrect, the query takes 4s to evaluate because that is how long it takes to iterate over all results.

On your actual question

You haven't shown what your buildQuery() method does but you say you are building the query as a Op structure programmatically rather than as a string? If this is the case then the query engine may not actually be applying optimization though off the top of my head I don't think this will be the issue. You can try adding an op = Algebra.optimize(op); before you return the built Op but I don't know that this will make much difference.

It looks like the optimizer should do a good job just given the raw query (not that your query has much scope for optimization other than join reordering) but if you are building it programmatically then you may be building an unusual algebra which the optimizer struggles with.

Similarly I'm not sure if you stats.opt file will be honored because you query over a specific model rather than the TDB dataset so the query engine might be the general purpose rather than the TDB engine. I'm not an expert in TDB so I can't tell if this is the case or not.

Bottom Line

In general there is not enough information in your question to diagnose if there is an actual issue in your setup or if your query is just plain expensive. Reporting this as a minimal test case (minimal complete code plus sample data) to the user@jena.apache.org list for further analysis would be useful.

As a general comment on your query lots of range filters are expensive to perform which is likely where most of the time goes.

answered Aug 29 '13 at 19:31

RobV

28,022
11
77
119

Thanks Rob for your answer. On arq query engine, I figured as much. On my actual question, i tried Algebra.optimize(op) with no effect removing stats.opt has also no effect. Removing the bnode values ?apartment dssd:hasNearby ?hasNearbyStop . ?hasNearbyStop dssd:hasNearbyStop ?stop . ?apartment dssd:hasNearby ?hasNearbySupermarket . ?hasNearbySupermarket dssd:hasNearbySupermarket ?supermarket . saves me almost 4 seconds, so this is the most expensive part. But I still can't believe that this query takes 4s on my extremely small dataset (which I can't make public). – Daniel Gerber Aug 30 '13 at 09:07
since I'm working with jena on e.g. dbpedia (with more complex queries) for a long time now. – Daniel Gerber Aug 30 '13 at 09:09
@DanielGerber Remember that your notion of what is a complex query and ARQ's don't necessarily align! Often with SPARQL the size of the dataset is largely irrelevant to performance, what's important is the size of the intermediate results. It would be interesting if you could at least post your `stats.opt` file since that will give us some idea of the characteristics of your dataset without revealing your data. – RobV Aug 30 '13 at 16:10
To clarify, there is typically a correlation between the size of the dataset and the size of intermediate results that will be produced and thus how long it takes to answer a query. **However** it is relatively trivial to write a query that performs terribly regardless of how small the dataset is. And conversely to write a query that is always very fast regardless of how large the dataset is. – RobV Aug 30 '13 at 18:01
Hey @RobV, could you please help me clarify: is the model in RAM or is the query executed over a hdd (if I execute it like above)? if hdd, can I somehow load it to RAM completely? [Here](https://www.dropbox.com/s/qn6kqy8t7926v6t/stats.opt) are my stats. Would be more useful to exclude the bnodes from the query and use the model api to get the data? thanks rob, daniel – Daniel Gerber Sep 02 '13 at 12:10
Hmmm, so it looks like your intermediate results should be relatively small so the problem is indeed likely in the complexity of all the filters. Have you tried comparing performance between running the query without any `FILTER` clauses and with your `FILTER` clauses which should tell you whether the filtering is the performance sink. – RobV Sep 04 '13 at 17:49
As I wrote before, the filters are not the expensive part. Removing the bnodes (supermarket, park) makes this query execute in 100ms. – Daniel Gerber Sep 05 '13 at 10:08

Jena ARQ/TDB Query Optimization

1 Answers1

On the ARQ Query Engine

On your actual question

Bottom Line