3

I'm fairly new to SPARQL, and am experiencing some interesting behaviour that I do not understand.

So I have four genes:

a-gene
b-gene
c-gene
d-gene

and two strains:

strain1
strain2

and the following triples:

strain1 hasGene a-gene
strain1 hasGene b-gene
strain1 hasGene d-gene

strain2 hasGene b-gene
strain2 hasGene d-gene

My goal is to make a SPARQL query that adds a value to the property hasBinary for all strains, where hasBinary is the corresponding binary for what genes a strain has and has not. eg:

strain1 hasBinary 1101
strain2 hasBinary 0101

strain1 has gene a-gene, b-gene, d-gene, but not c-gene. Given the query (note strains and genes are in classes Strain and Gene, respectively):

select ?s (group_concat(?result ; separator="") as ?binary)
where { ?g a Gene.
        ?s a Strain.
        optional{?s ?hasGene ?g.}.
        bind((if(bound(?hasGene), "1","0")) as ?result). }
group by ?s
order by ?s

The output is:

strain1 1101
strain2 0101 

Which is correct. But when I do the query:

construct {?s hasBinary ?binary}
where{

select ?s (group_concat(?result ; separator="") as ?binary)
where { ?g a Gene.
        ?s a Strain.
        optional{?s ?hasGene ?g.}.
        bind((if(bound(?hasGene), "1","0")) as ?result).


      }
group by ?s
order by ?s

}

The output is:

strain1 hasBinary 0111
strain2 hasBinary 0011

Which is totally wrong. It's as though group_concat is ordering the result. I have no idea why it is doing this, the binary is useless if it's ordered. Any help with this issue would be appreciated.

CubeJockey
  • 2,209
  • 8
  • 24
  • 31
sam_peels
  • 71
  • 1
  • 4
  • For future reference, can you make sure you actually post syntactically correct examples: proper RDF and correct SPARQL (_including_ namespace definitions etc)? That way people who want to try and help don't have to fix it up first before they can actually try out your problem. – Jeen Broekstra Jan 14 '16 at 19:36
  • 1
    I think you're relying on an ordering, which is not meant to be reliable. Triples are not necessarily stored nor returned in the order they are written, inserted, etc. It's entirely legitimate for you to get `strain2 hasBinary` `0011` or `0101` or `1010` or `1100` or any permutation with two 1s and two 0s. If you want the Genes to be ordered before the `bind()` and thus before the `group_concat()`, you must say so in your query. – TallTed Jan 14 '16 at 20:53

1 Answers1

4

As far as I can tell, your queries are correct, and the problem you observe is a bug in whatever SPARQL engine you're using. Or at least: when I tried your case on a Sesame store (version 2.8.8), it gave me the expected result.

EDIT the reason I got correct results is that Sesame just happens to return results in the expected order, but as @TallTed correctly remarked, this is not actually enforced by the query, so it's not something you can depend on. So my earlier assertion that this is a bug in the endpoint is wrong.

Let's explore this a bit.

Data I used:

@prefix : <http://example.org/> .

:a-gene a :Gene .
:b-gene a :Gene .
:c-gene a :Gene .
:d-gene a :Gene .

:strain1 a :Strain .
:strain2 a :Strain .

:strain1 :hasGene :a-gene .
:strain1 :hasGene :b-gene .
:strain1 :hasGene :d-gene .


:strain2 :hasGene :b-gene .
:strain2 :hasGene :d-gene .

If we look at the simplest form of the query, we want back all ?s and all ?g for which optionally there is a :hasGene relation, and we want them in order. Your initial query was basically this:

PREFIX : <http://example.org/>
select ?s ?g
where { ?g a :Gene.
        ?s a :Strain.
        optional { ?s ?hasGene ?g } .
}
order by ?s

Now, this query, in my Sesame store (and your endpoint as well), returns this:

?s                              ?g
<http://example.org/strain1>    <http://example.org/a-gene>
<http://example.org/strain1>    <http://example.org/b-gene>
<http://example.org/strain1>    <http://example.org/c-gene>
<http://example.org/strain1>    <http://example.org/d-gene>
<http://example.org/strain2>    <http://example.org/a-gene>
<http://example.org/strain2>    <http://example.org/b-gene>
<http://example.org/strain2>    <http://example.org/c-gene>
<http://example.org/strain2>    <http://example.org/d-gene>

Looks good right? All in alphanumeric order. But it's important to realize that the ordering of the ?g column here is coincidence. If the engine had returned this instead:

?s                              ?g
<http://example.org/strain1>    <http://example.org/c-gene>
<http://example.org/strain1>    <http://example.org/b-gene>
<http://example.org/strain1>    <http://example.org/a-gene>
<http://example.org/strain1>    <http://example.org/d-gene>
<http://example.org/strain2>    <http://example.org/b-gene>
<http://example.org/strain2>    <http://example.org/a-gene>
<http://example.org/strain2>    <http://example.org/c-gene>
<http://example.org/strain2>    <http://example.org/d-gene>

...it would also have been a valid result - after all, nowhere does our query say that ?g should be ordered.

The solution is straightforward, however: order on both ?s and ?g. Since in our particular SPARQL endpoint the correct order is already "coincidentally" returned even without this, we can verify that it works with a little trick: revert the order, using the DESC operator.

Query:

PREFIX : <http://example.org/>
SELECT ?s ?g
WHERE { ?g a :Gene.
        ?s a :Strain.
        OPTIONAL { ?s ?hasGene ?g } .
}
ORDER BY ?s DESC(?g)

Result:

?s                              ?g
<http://example.org/strain1>    <http://example.org/d-gene>
<http://example.org/strain1>    <http://example.org/c-gene>
<http://example.org/strain1>    <http://example.org/b-gene>
<http://example.org/strain1>    <http://example.org/a-gene>
<http://example.org/strain2>    <http://example.org/d-gene>
<http://example.org/strain2>    <http://example.org/c-gene>
<http://example.org/strain2>    <http://example.org/b-gene>
<http://example.org/strain2>    <http://example.org/a-gene>

You can see the ?g column is now actually ordered in reverse alphabetical (this is of course the reverse of what you wanted, but that's easily corrected by just leaving out the DESC part of the query later - the point is that this way we have verified that it's our query doing the ordering, not whatever endpoint we are using).

It still won't fully solve the problem of the ordering in your binary string though. Since in your original query the BIND takes place before ordering (because the bind is part of the graph pattern, which gets fully evaluated before result ordering occurs), the ORDER BY clause has no influence on it. That is, if we simply do this query:

PREFIX : <http://example.org/>
SELECT ?s (GROUP_CONCAT(?result ; SEPARATOR="") as ?binary)
WHERE { ?g a :Gene.
        ?s a :Strain.
        OPTIONAL { ?s ?hasGene ?g } .
        BIND((IF(BOUND(?hasGene), "1","0")) AS ?result). 
}
GROUP BY ?s 
ORDER BY ?s DESC(?g)

We still get back this result:

?s  ?binary
<http://example.org/strain1>    "1101"
<http://example.org/strain2>    "0101"

In other words, our binary string is still not inverted, as it should be.

The solution is to introduce a further subquery, which delivers the results needed in order to its outer query, which then concatenates this ordered result to create the binary string, like so:

PREFIX : <http://example.org/>
SELECT ?s (GROUP_CONCAT(?result ; SEPARATOR="") as ?binary)
WHERE { 
  { SELECT ?s ?hasGene  
    WHERE { ?g a :Gene.
            ?s a :Strain.
            OPTIONAL {?s ?hasGene ?g.}.
    }
    ORDER BY ?s DESC(?g)
  }
  BIND((IF(BOUND(?hasGene), "1","0")) AS ?result). 
}
GROUP BY ?s

The result of this is:

?s  ?binary
<http://example.org/strain1>    "1011"
<http://example.org/strain2>    "1010"

As you can see, the correct (inverted) binary string is now enforced by the query. We then need to feed this entire beast into the CONSTRUCT query you wanted, and we finally need to take out that inversion of the binary string.

The full query then becomes this:

Query 2:

PREFIX : <http://example.org/>
CONSTRUCT {?s :hasBinary ?binary }
WHERE {
  SELECT ?s (GROUP_CONCAT(?result ; SEPARATOR="") as ?binary)
  WHERE { 
    { SELECT ?s ?hasGene  
      WHERE { ?g a :Gene.
              ?s a :Strain.
              OPTIONAL {?s ?hasGene ?g.}.
      }
      ORDER BY ?s ?g
    }
    BIND((IF(BOUND(?hasGene), "1","0")) AS ?result). 
  }
  GROUP BY ?s
}

Result:

<http://example.org/strain1> <http://example.org/hasBinary> "1101" .
<http://example.org/strain2> <http://example.org/hasBinary> "0101" .
Jeen Broekstra
  • 21,642
  • 4
  • 51
  • 73
  • The fact that the expected result was obtained from your Sesame store is just an accident. The graph doesn't order the triples; the `:Gene` values may be returned in any order, resulting in any order of digits in the `?binary`. A subquery to sort the `?g` values prior to the `group_concat()` is the only sure way to deliver what the OP wants. – TallTed Jan 15 '16 at 04:19
  • @TallTed I was under the impression aggregates were evaluated _after_ processing grouping and ordering. I haven't looked at the algebra details in a while though - I'll doublecheck. If you're right, I'll update my answer to show the subquery approach. – Jeen Broekstra Jan 15 '16 at 04:43
  • 1
    Note that neither the `GROUP BY` nor `ORDER BY` operates on the `?g` which ordering is what impacts the `?binary` construction -- so it doesn't matter when aggregates are evaluated relative to these. – TallTed Jan 15 '16 at 14:56
  • How did i miss that? Odd. Thanks for the heads up, I'll adapt my answer. – Jeen Broekstra Jan 15 '16 at 20:04