1

This is an evolution of this question.

Basically I am having trouble getting all the solutions to a SPARQL query from a remote endpoint. I have read through section 2.4 here because it seems to describe a situation almost identical to mine.

The idea is that I want to filter my results from DBPedia based on information in my local RDF graph. The query is here:

PREFIX ns1:             
<http://www.semanticweb.org/caeleanb/ontologies/twittermap#>
PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT *
WHERE {
  ?p ns1:displayName ?name .
  SERVICE <http://dbpedia.org/sparql> {
    ?s rdfs:label ?name .
    ?s rdf:type foaf:Person .
  }
}

And the only result I get is dbpedia:John_McCain (for ?s). I think this is because John McCain is the only match in the first 'x' results, but I can't figure out how to get the query to return all matches. For example, if I add a filter like:

SERVICE <http://dbpedia.org/sparql> {
  ?s rdfs:label ?name .
  ?s rdf:type foaf:Person .
  FILTER(?name = "John McCain"@en || ?name = "Jamie Oliver"@en)
}

Then it correctly returns BOTH dbpedia:Jamie_Oliver and dbpedia:John_McCain. There are dozens of other matches like Jamie Oliver that do not come through unless I specifically add it to a Filter like this.

Can someone explain a way to extract the rest of the matches? Thanks.

Evan
  • 61
  • 5
  • Assuming that the `SERVICE` clause is computed first, you can't do anything as the public DBpedia endpoint has a default limit of 10000 results that will be returned by a single query. And I'm pretty sure, that this is not considered by the federated query engine of your triple store. By the way, it's always interesting to know which triple store is used. – UninformedUser Oct 21 '17 at 11:20
  • I'm using Stardog. But damn :/ So basically I'm receiving 10,000 results from DBPedia and that's only enough to match John McCain? And when I use the FILTER I reduce the size way below 10,000 so I can see more matches? – Evan Oct 21 '17 at 11:24
  • I guess the SPARQL standard for federated query assumes that there is no technical limit for the returned resultset when specifying the semantics - which indeed makes sense. Right, I guess it's just by chance that John McCain is in the first 10,000 matching results. – UninformedUser Oct 21 '17 at 11:33
  • @AKSW Well I just wrote a script to create a FILTER containing all of the string I want to match (like FILTER(?name="name1" || ?name="name2"...)) but I get an HTTP 500 (Internal Server Error) when I try to execute the query. Is there a limit on how long my FILTER can be? – Evan Oct 21 '17 at 11:35
  • Oh yes, there is a limit on the length of a HTTP GET request for Virtuoso. As far as I remember it was `10000 byte` length of the query string. For longer queries, you would have to use a POST request – UninformedUser Oct 21 '17 at 11:39
  • That was the last piece of the puzzle for me :) With a suuuuuper long FILTER block and using POST I got the info that I wanted. I'm new to Stack Overflow but is there a way for me to recognize your answer? – Evan Oct 21 '17 at 11:42
  • Cool, you should provide your final solution as an answer here. This helps others with similar problems. And don't forget to accept your own answer :D – UninformedUser Oct 21 '17 at 11:53
  • And before I forget: With SPARQL 1.1 it's possible to make the query more compact using `IN` keyword, i.e. `FILTER(?name IN ("John McCain"@en, "Jamie Oliver"@en, ...))` – UninformedUser Oct 21 '17 at 11:57
  • [Stardog ships with a default Service implementation which uses SPARQL Protocol to send the service fragment to the remote endpoint and retrieve the results.](https://www.stardog.com/docs/#_federated_queries) It seems that even if your Stardog contains single person, it will try to retrieve all DBpedia persons. The only thing you can do is to be more selective (e. g. add something like `?s a dbo:USA_Politician`) to avoid huge resultsets or put applicable values into `SERVICE` part manually. – Stanislav Kralin Oct 21 '17 at 13:22

1 Answers1

2

It looks like the cause of this issue is that the SERVICE block is attempting to pull all foaf:Persons from DBPedia, and then filter them based on my local Stardog db. Since there is a 10,000 result limit when querying DBPedia, only matches which occur in that set of 10,000 arbitrary Persons will be found. To fix this, I wrote a script to put together a FILTER block containing every string name in my Stardog db and attached it to the SERVICE block to filter remotely and thereby avoid hitting the 10,000 result limit. It looks something like this:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX ns1: <http://www.semanticweb.org/caeleanb/ontologies/twittermap#>

CONSTRUCT{
  ?s rdf:type ns1:Person ;
    ns1:Politician .
}
WHERE {
    ?s rdfs:label ?name .
    ?s rdf:type dbo:Politician .
    FILTER(?name IN ("John McCain"@en, ...)
}
Evan
  • 61
  • 5
  • 1
    Just two minor comments: 1.) use more compact Turtle syntax 2.) it should be enough to use `?s rdf:type dbo:Politician .` because `dbo:Politician` is a subclass of `dbo:Person` which is an equivalent class of `foaf:Person` – UninformedUser Oct 22 '17 at 18:56
  • @Evan, yet another minor comment: knowing people [said](https://stackoverflow.com/a/39751134/7879193) that `a` can be a performance killer. Probably you do not need these `a`, if you are using `VALUES` (or something like). – Stanislav Kralin Oct 23 '17 at 09:38
  • @StanislavKralin, sorry I'm pretty new to SPARQL, I'm not sure exactly how to do that replacement. I believe I understand that a VALUES block allows for multi-dimensional filtering based on a table of allowed values, but wouldn't I still have to specify something like: `WHERE { ?s ?p ?o }` `VALUES (?p ?o) { (rdf:type dbo:Politician) }` Does this increase performance by doing the filtering by rdf:type after the results come back from DBPedia? – Evan Oct 23 '17 at 12:33
  • @Evan, `VALUES` is more canonical form for providing inline data, but I do not think that `VALUES (?name) {("John McCain"@en) ("John McGain"@en) ...}` will be more performant than `FILTER(?name IN ("John McCain"@en, "John McGain"@en, ...)`. But I mean that _possibly_ your query will be more performant _without_ `?s rdf:type dbo:Politician`, though some irrelevant results will be possible... – Stanislav Kralin Oct 23 '17 at 12:42
  • @StanislavKralin ah I see. I appreciate the tip but I think I must keep that part of my query intact for my use case. Luckily I don't need to worry too much about performance because I will be running several similar `CONSTRUCT` queries in advance to retrieve triples for a Stardog db which will be treated essentially as static data. – Evan Oct 23 '17 at 12:59