1

I am trying to figure out how Jena TDB handles SPARQL queries with multiple FROM clauses on the physical query plan level. I would like to know how Jena TDB handles executing a query over different graphs.

I have made some small experiments and looked at the query algebra, however, it is not clear to me how the FROM clauses affect the algebra. It looks like that the FROM clauses are discarded in the algebra. I expect that the algebra is evaluated over the union of the graphs, but I would like to be sure.

I have the following quads:

<http://example.com/book2/> <http://example.com/price> "5"^^<http://www.w3.org/2001/XMLSchema#integer> <http://example.com/A> .
<http://example.com/book2/> <http://example.com/title> "Lord of the Rings" <http://example.com/B> .

and the following query:

SELECT (AVG(?price) as ?total)
FROM <http://example.com/A>
FROM <http://example.com/B>
WHERE {
    ?book <http://example.com/price> ?price .
    ?book <http://example.com/title> ?title .
}

./tdbquery --loc test --query test.sparql --explain

The query algebra looks as follows:

INFO  exec                 :: ALGEBRA
  (project (?total)
    (extend ((?total ?.0))
      (group () ((?.0 (avg ?price)))
        (bgp (triple ?book <http://example.com/price> ?price)))))

When I execute the query over the data I receive the expected result.

1 Answers1

0

FROM (and FROM NAMED) aren't really part of the query, but indications of what the dataset to be queried ought to be. These clauses don't alter what the query will do, only what it operates on, so you don't see them in the algebra.

What a particular processor does with that information varies:

  • some processors will build the requested dataset (even downloading data)
  • but it is also common to provide a dataset explicitly in APIs (e.g. query(query_string, dataset)) in which case the processor will ignore it since a dataset has been provided.
  • a dataset might also be supplied with in a SPARQL protocol request, in which case, as with the API call, the processor will ignore the NAMED clause.

Now a TDB database is a dataset, but TDB has a special feature called 'dynamic datasets' which used FROM and FROM NAMED to form a sub-dataset in effect, limiting the graphs queried to those mentioned in the FROM clauses.

user205512
  • 8,798
  • 29
  • 28
  • In the case I presented above, the datasets need to be merged at some point in order for the query to be answered. The query will not yield an answer if it executed over graph A and then graph B. I would like to know how this concretely work. – Kim Ahlstrøm Meyn Mathiassen Oct 21 '16 at 13:25
  • 1
    Aha! TDB has a special feature: [dynamic datasets](https://jena.apache.org/documentation/tdb/dynamic_datasets.html), which is why your query works. – user205512 Oct 21 '16 at 14:45
  • I see, do you happen to know when the dynamic dataset is constructed? Is it done once before the query is answered or for each "leaf" in the query plan? My example above is for to simple to show this problem, if my meaning is not clear then I will post a new question explain the problem with an example. – Kim Ahlstrøm Meyn Mathiassen Oct 23 '16 at 09:57
  • The reason why I ask about this being pushed to the query plan leafs is because I made a small experiment where I split the same 1000 triples into 10, 100, and 1000 graphs. In the experiment, I observed the query runtime increased exponential for some queries. If I first executed a CONSTRUCT query to create a single graph with all triples, then the query scales almost linear. I am just trying to figure out what is the reason for this behavior. – Kim Ahlstrøm Meyn Mathiassen Oct 23 '16 at 10:01
  • 2
    A dynamic dataset is essentially a point of indirection within the Jena APIs. It essentially intercepts each database scan requested by the query engine and converts it into multiple database scans merging the results together. This merging also has to take into account duplicate removal since a graph is a mathematical set. The more graphs you split your data across the more underlying database scans are required. All scans in TDB use indexes but going from 10 to 100 graphs results in 10x the number of scans being required and so forth. – RobV Oct 24 '16 at 09:44