2

While I was exploring some SPARQL queries I noticed that fetching distinct predicates is extremely slow but no such issues while fetching subjects or objects.

I tested it with linkedgeodata and I ran the following queries at linkedgeodata's endpoint (SERVICE command not used in this case for obvious reasons), SPARQL playground and Apache Jena Fuseki server. The behavior was same. Can anyone help me understand the reason behind it?

#selecting distinct subjects. Executes fast
SELECT * WHERE {
 SERVICE <http://linkedgeodata.org/sparql> {
 select distinct ?s
    where{
    ?s ?p ?o .        
    } limit 100
 }  
}

#selecting distinct predicates. VERY SLOW
SELECT * WHERE {
 SERVICE <http://linkedgeodata.org/sparql> {
 select distinct ?p
    where{
    ?s ?p ?o .        
    } limit 100
 }  
}
RDangol
  • 179
  • 9
  • 3
    Usually, a dataset has a much smaller schema compared to the size of the instance data, i.e. there are some properties and classes but many triples that use those classes and properties. Your query has to iterate over the triples in the dataset until enough predicates have been found (i.e. the LIMIT was reached). Indeed, this can result in scanning the whole dataset if there are less than 100 predicates. LinkedGeoData has a small number of properties and is a very big dataset, thus, your second query will be much slower. – UninformedUser Aug 05 '17 at 04:46
  • @AKSW thanks, that makes sense – RDangol Aug 05 '17 at 04:59
  • 2
    @RDangol, then make predicates subjects: `SELECT DISTINCT ?p {?p a rdf:Property} LIMIT 100`. Fortunately, LinkedGeoData contains schema assertions. – Stanislav Kralin Aug 05 '17 at 05:22
  • 2
    @RDangol, though, many of those declared properties are not used actually. Compare `SELECT DISTINCT ?p {?p a rdf:Property . FILTER EXISTS { ?s ?p ?o }}` (75 results) and `SELECT DISTINCT ?p {?p a rdf:Property . FILTER NOT EXISTS { ?s ?p ?o }}` (150 results). – Stanislav Kralin Aug 05 '17 at 06:06
  • @StanislavKralin OMG thank you. It is fast! I do not understand the reason behind it though. Gotta figure out what schema assertions means. – RDangol Aug 05 '17 at 06:35
  • 2
    Schema assertion means that there is an explicit triple in the dataset that denotes the type of an entity, e.g. that `:locatedIn` is a property or `:Place` is a class. Indeed, querying for this is much more efficient than iterating over all triples in the dataset as indexes like `p o s` can be used (that's just some technical aspect to improve query performance) – UninformedUser Aug 05 '17 at 07:21
  • @AKSW thanks again – RDangol Aug 05 '17 at 07:45
  • be careful with schema assertions though... there are many datasets out there that lack them, and most endpoints do not perform any kind of reasoning... you're likely to end up with much less `?p` than via `SELECT DISTINCT ?p { ?s ?p ?o }`. – Jörn Hees Oct 08 '17 at 15:20

1 Answers1

2

Answered in comments by @AKSW; rephrased a bit here --

Usually, the schema of a dataset comprises many fewer triples than hold the instance data; i.e., there are some properties and classes, but many more triples that use each of those classes and properties.

Your query has to iterate over the triples in the dataset until enough predicates have been found (i.e., until the LIMIT is reached). This can even result in scanning the whole dataset if there are fewer predicates than your LIMIT (fewer than 100, here).

LinkedGeoData has a fairly small number of properties (~1,805; see query text and live result [takes approximately 3 minutes]) and a fairly large number of triples (~1,384,887,592; see query text and live result [takes approximately 1 minute]), thus, your second query will be much slower.

A predicate index would certainly speed up this query; it's just not a default index in Virtuoso databases, because it wouldn't provide much benefit in most common scenarios (which this query is not). We discuss our default "3+2" indexing scheme, and how to add some additional sometimes-valuable indexes, in the documentation.

TallTed
  • 9,069
  • 2
  • 22
  • 37
  • while this answer accurately describes the state of the art, is there an actual reason why such a query can't be served quickly from a predicate index? – Jörn Hees Oct 08 '17 at 15:21
  • 1
    @JörnHees - No; see my edited answer which points to relevant docs. – TallTed Dec 27 '17 at 14:34