Optimal Filter Placement in SPARQL Queries

Question

One of the optimizations performed by JenaARQ is to: "Place filters close to where their dependency variables are defined".

This causes the following Query Plan:

  (filter (exprlist (|| (|| (isIRI ?Y) (isBlank ?Y)) (!= (datatype ?Y) <http://example.com/onto/rdf#structure>)) (|| (|| (isIRI ?Z) (isBlank ?Z)) (!= (datatype ?Z) <http://example.com/onto/rdf#structure>)))
    (bgp
      (triple ?X <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swat.cse.lehigh.edu/onto/univ-bench.owl#Student>)
      (triple ?Y <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swat.cse.lehigh.edu/onto/univ-bench.owl#Faculty>)
      (triple ?Z <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swat.cse.lehigh.edu/onto/univ-bench.owl#Course>)
      (triple ?X <http://swat.cse.lehigh.edu/onto/univ-bench.owl#advisor> ?Y)
      (triple ?Y <http://swat.cse.lehigh.edu/onto/univ-bench.owl#teacherOf> ?Z)
      (triple ?X <http://swat.cse.lehigh.edu/onto/univ-bench.owl#takesCourse> ?Z)
    )))

To be transformed into the following:

  (sequence
    (filter (|| (|| (isIRI ?Z) (isBlank ?Z)) (!= (datatype ?Z) <http://example.com/onto/rdf#structure>))
      (sequence
        (filter (|| (|| (isIRI ?Y) (isBlank ?Y)) (!= (datatype ?Y) <http://example.com/onto/rdf#structure>))
          (bgp
            (triple ?X <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swat.cse.lehigh.edu/onto/univ-bench.owl#Student>)
            (triple ?Y <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swat.cse.lehigh.edu/onto/univ-bench.owl#Faculty>)
          ))
        (bgp (triple ?Z <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://swat.cse.lehigh.edu/onto/univ-bench.owl#Course>))))
    (bgp
      (triple ?X <http://swat.cse.lehigh.edu/onto/univ-bench.owl#advisor> ?Y)
      (triple ?Y <http://swat.cse.lehigh.edu/onto/univ-bench.owl#teacherOf> ?Z)
      (triple ?X <http://swat.cse.lehigh.edu/onto/univ-bench.owl#takesCourse> ?Z)
    )))

It turns out that while the original query plan runs in milliseconds the "optimized" query plan takes about 7 hours to be concluded.

Does JenaARQ consider any statistics for optimizing the filter placement in the query plan?

I'm using Jena 3.12.0.

What storage layer is this running over? (I would guess TDB) It looks like the issue is that the BGP is being broken up in a less than idea way. There are two competing optimizations: placing filters and reordering basic graph patterns. You can explore this by reordering the basic graph pattern - put the 3 rdf:type triple patterns at the end. There are statisitics but the nature of the filter (is it highyl selective or just a check for a few odd cases) makes it a hard problem to choose whether a filter is better than a reorder. — AndyS, Jan 16 '20 at 14:02
Yes, I am using TDB1 and TDB2 for comparison. Both presented similar behavior. I was able to reduce query response time in orders of magnitude by setting the following option in the dataset assembler. ```:ja:context [ ja:cxtName "arq:optFilterPlacement" ; ja:cxtValue "false" ] ;``` — Elton Soares, Jun 20 '20 at 04:48
As I'm using queries from a public benchmark I'd like to be able to optimize the performance without changing the original queries. — Elton Soares, Jun 20 '20 at 04:52
It is a pragmatic choice whether to push filters in or to use the fact the triple pattern will eliminate possibilities. Turning an optimization off if it does not work is one option. — AndyS, Jun 20 '20 at 13:55

Optimal Filter Placement in SPARQL Queries

0 Answers0