1

Note: possible GrapbDB bug (see comments)

I have this knowledge base in GraphDB:

PREFIX : <http://my_awesome_cats_collection#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>


:foo a :cat ;
     :name 'Marble' ;
     owl:sameAs wd:Q27745011 .
# and many other cats

I tried this federated query

select * where { 
    # remote service
    SERVICE <https://query.wikidata.org/sparql> {
        ?cat wdt:P463 ?membership
    }

    ?cat :name ?name .
    VALUES ?name {'Marble'}

} 

and I got the expected results from Wikidata (i.e., Marble member of Musashi's).

If I switch the order of the patterns like this:

select * where { 

    ?cat :name ?name .
    VALUES ?name {'Marble'}

    # remote service
    SERVICE <https://query.wikidata.org/sparql> {
        ?cat wdt:P463 ?membership
    }
} 

I get many false positive results (i.e., data of other cats belonging to the Musashi's while I'd like to get just Marble. A kind of cross product between local and remote patterns, I guess).

In the official doc of SPARQL 1.1, they say:

Federated Query may use the VALUES clause to constrain the results received from a remote endpoint based on solution bindings from evaluating other parts of the query.

(the excerpt is informative. thanks to @TallTed for pointing this out)

So, when federating, can VALUES only be used as a final filter? What is going on?

EDIT:

  • Queries are performed with GraphDB
  • It seems a bug of GraphDB query optimizer (thanks to: Stanislav Kralin)
floatingpurr
  • 7,749
  • 9
  • 46
  • 106
  • 1
    Not knowing your "false positive results", nor your local SPARQL engine, it's difficult to say "What is going on." First thing I see is what appears to be a typo in both of your queries -- `?ca wdt:P463` should probably be `?cat wdt:P463`. Then, I note that your excerpted section of the SPARQL 1.1 doc ("2.4 Interplay of SERVICE and VALUES (Informative)") is *Informative*, not *Normative*, and it says *MAY* so it is only a guidance of possibility. – TallTed Nov 26 '18 at 17:56
  • I tried those queries with GraphDB. Ops, I'm sorry: I'm going to fix typos. Thanks, this is not the problem, though. I see that the excerpt is just _informative_ but it is the only reference I found when I was trying to figure that strange behavior out. Regarding false positives, I get also data of other cats belonging to the Musashi's while I'd like to get just Marble – floatingpurr Nov 26 '18 at 18:11
  • 2
    1. It appears to be a GraphDB's query optimizer bug. 2. It also seems that the result depends on ruleset selected. 3. In the first query, `?cat :name 'Marble'` is faster. 4. Section 2.4 is irrelevant, it is about how to dispatch remote queries constrained via locally obtained solutions. – Stanislav Kralin Nov 26 '18 at 18:30
  • 1
    Hey @StanislavKralin, nice to see you here! Ok, cool! I'm gonna add the GraphDB tag for luring guys from Ontotext. I can definitely use `?cat :name 'Marble'` but I cannot query more cats in such a way (this is the reason why I'm trying to use `VALUES`). Thanks! – floatingpurr Nov 26 '18 at 18:44
  • Just a thought -- you might try `FILTER ( ?name IN ( 'Marble' ) )` instead of `VALUES ?name {'Marble'}`... This will be slower in many situations, but if it doesn't trigger the apparent bug, that lack of speed may be justified. – TallTed Nov 26 '18 at 20:35
  • This example will work only when the ruleset supports `owl:sameAs`. If you use the default ruleset the engine will not infer that `:foo` is equivalent to `wd:Q27745011` – vassil_momtchev Nov 27 '18 at 05:34
  • Hi @vassil_momtchev. AFAIK GraphDB uses a non-rule implementation of `owl:sameAs`. Therefore, if the repo is properly set, GraphDB makes such kinds of inferences independently of the ruleset. – floatingpurr Nov 27 '18 at 10:33
  • Can you confirm this is true? :) – floatingpurr Nov 27 '18 at 16:38

1 Answers1

2

The example you have posted demonstrate one of the corner cases of the SPARQL specification, which combines multiple related topics and are highly ambiguous in my opinion. The details below explain what are the taken assumptions and design decisions in the GraphDB engine. Please note that this might be different from the way other implementations read the following specification lines:

Interplay of SERVICE and VALUES

The SPARQL Federation 1.1 has a non-normative section describing what should be the behavior in this case:

Implementers of SPARQL 1.1 Federated Query may use the VALUES clause to constrain the results received from a remote endpoint based on solution bindings from evaluating other parts of the query.

GraphDB's query optimizer cannot retrieve any statistics from the remote SPARQL endpoint, so it takes the approach to throw naively the query to the remote SERVICE and join locally the results. Thus, the query optimization task is in the hands of the user who knows the schema in the two repositories by rearranging the query in a procedural way (see below).

Federated queries are sub-queries

Every remote query is treated as a sub-query and sent as it is to the external endpoint. Here is the equivalent syntax:

# remote service
SERVICE <https://query.wikidata.org/sparql> {
    SELECT ?cat ?membership {
        ?cat wdt:P463 ?membership
    }
    LIMIT <put any limit>
}

Sub-queries are evaluated first and all variables are propagated bottom-up

According to the SPARQL specification, no variable bindings should be pushed in the sub-query from the outside:

Subqueries are a way to embed SPARQL queries within other queries, normally to achieve results which cannot otherwise be achieved, such as limiting the number of results from some sub-expression within the query.

Due to the bottom-up nature of SPARQL query evaluation, the subqueries are evaluated logically first, and the results are projected up to the outer query.

Note that only variables projected out of the subquery will be visible, or in scope, to the outer query.

At this point, it's no longer possible to efficiently execute queries with a very selective local clause. That's why GraphDB database exposes a special configuration parameter to break the compliance with the SPARQL specification with:

./graphdb -Dreuse.vars.in.subselects

In this case, the query engine will ignore the SPARQL spec and will push the variable from the outer query inside the sub-select. Your correct version of the query after enabling this parameter is:

PREFIX : <http://my_awesome_cats_collection#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

select * where {
    
    ?cat :name ?name .
    VALUES ?name {
        'Marble'
    }
    
    # remote service
    SERVICE <https://query.wikidata.org/sparql> {
        ?cat wdt:P463 ?membership
    }
}

How the use should optimize the query execution plan of remote endpoints

VALUES/BIND are procedural and their place is significant according to the SPARQL specification

The BIND form allows a value to be assigned to a variable from a basic graph pattern or property path expression. Use of BIND ends the preceding basic graph pattern. The variable introduced by the BIND clause must not have been used in the group graph pattern up to the point of use in BIND.

Another form of the same query much less efficient in this particular case is to first execute the remote endpoint query (i.e. download all results from Wikidata) and then join them with the local smaller dateset:

PREFIX : <http://my_awesome_cats_collection#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

select * where {
    
    # remote service
    SERVICE <https://query.wikidata.org/sparql> {
        ?cat wdt:P463 ?membership
    }

    ?cat :name ?name .
    VALUES ?name {
        'Marble'
    }
}

I hope this gives you the full picture around the GraphDB interpretation of the SPARQL specification and all possibilities how to optimize federated queries.

Community
  • 1
  • 1
vassil_momtchev
  • 1,173
  • 5
  • 11
  • then two questions appears... 1. Should federated queries be considered to be subqueries which are evaluated logically first? It seems that section 2.4 does not consider bottom-up semantics violation to be harmful in that case. 2. Why join is not performed? In the first query, it is not performed even with (local) `?cat_local owl:sameAs ?cat_remote`. – Stanislav Kralin Nov 26 '18 at 20:55
  • 1
    I have slightly restructured my answer to better address your comment – vassil_momtchev Nov 27 '18 at 05:33
  • The correct command line option is `./graphdb -Dreuse.vars.in.subselects=true`. There is an issue under consideration (GDB-3043) to push values into federated queries. – Vladimir Alexiev Mar 10 '19 at 12:39