How do I improve indexing of large SPARQL datasets?

Question

Here is a very simple SPARQL query that takes an extremely long time (10 seconds) to run in Marklogic (8.0-6.4). What can I do to speed it up?

The data is based on a subset of geonames, and is of the same order of magnitude (about 22 million triples, it looks like).

PREFIX  gj:   <http://mycompany.com/geonames-jurisdiction/1.0/schema#>
PREFIX  rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX  gn:   <http://www.geonames.org/ontology#>

SELECT  *
FROM <http://mycompany.com/geonames-jurisdiction/1.0/data>
FROM <http://mycompany.com/geonames-jurisdiction/1.0/rule-data>
WHERE
  { ?this_0  rdf:type  gj:LocalCounty ;
             gn:name   ?name_1 .
  }
ORDER BY ASC(?name_1)
LIMIT   100

Update

Per MarkLogic's suggestion, I ran a query which inserted a new property into the DB specific to local county:

INSERT {
  GRAPH <http://mycompany.com/geonames-jurisdiction/1.0/rule-data> {
    ?this gj:localCountyName ?name .
  }
}
WHERE {
    ?this a gj:LocalCounty .
    ?this gn:name ?name .
}

I have also made some suggested query revisions:

PREFIX  gj:   <http://mycompany.com/geonames-jurisdiction/1.0/schema#>
PREFIX  rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX  gn:   <http://www.geonames.org/ontology#>

SELECT ?this_0 ?name_1
FROM <http://mycompany.com/geonames-jurisdiction/1.0/data>
FROM <http://mycompany.com/geonames-jurisdiction/1.0/rule-data>
WHERE
  { ?this_0  rdf:type  gj:LocalCounty ;
             gj:localCountyName   ?name_1 .
  }
ORDER BY ?name_1
LIMIT   20

This reduces the total query time down to ~4 sec, which is better, but still enormous.

Trace information from the above query:

2017-05-04 12:00:18.684 Info: <triple-value-statistics count="147540458" unique-subjects="25064012" unique-predicates="81" unique-objects="67600843" xmlns="cts:triple-value-statistics">
2017-05-04 12:00:18.684 Info:   <triple-value-entries>
2017-05-04 12:00:18.684 Info:     <triple-value-entry count="8385355">
2017-05-04 12:00:18.684 Info:       <triple-value>http://www.w3.org/1999/02/22-rdf-syntax-ns#type</triple-value>
2017-05-04 12:00:18.684 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-05-04 12:00:18.684 Info:       <predicate-statistics count="8356279" unique-subjects="8341989" unique-objects="13"/>
2017-05-04 12:00:18.684 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-05-04 12:00:18.684 Info:     </triple-value-entry>
2017-05-04 12:00:18.684 Info:     <triple-value-entry count="29204">
2017-05-04 12:00:18.684 Info:       <triple-value>http://mycompany.com/geonames-jurisdiction/1.0/schema#LocalCounty</triple-value>
2017-05-04 12:00:18.684 Info:       <subject-statistics count="2" unique-predicates="2" unique-objects="2"/>
2017-05-04 12:00:18.684 Info:       <predicate-statistics count="0" unique-subjects="0" unique-objects="0"/>
2017-05-04 12:00:18.684 Info:       <object-statistics count="29202" unique-subjects="29202" unique-predicates="3"/>
2017-05-04 12:00:18.684 Info:     </triple-value-entry>
2017-05-04 12:00:18.684 Info:     <triple-value-entry count="29201">
2017-05-04 12:00:18.684 Info:       <triple-value>http://mycompany.com/geonames-jurisdiction/1.0/schema#localCountyName</triple-value>
2017-05-04 12:00:18.684 Info:       <subject-statistics count="0" unique-predicates="0" unique-objects="0"/>
2017-05-04 12:00:18.684 Info:       <predicate-statistics count="29201" unique-subjects="29201" unique-objects="26692"/>
2017-05-04 12:00:18.684 Info:       <object-statistics count="0" unique-subjects="0" unique-predicates="0"/>
2017-05-04 12:00:18.684 Info:     </triple-value-entry>
2017-05-04 12:00:18.684 Info:   </triple-value-entries>
2017-05-04 12:00:18.684 Info: </triple-value-statistics>
2017-05-04 12:00:18.684 Info: [Event:id=SPARQL AST] sessionKey=7777437449602930525
2017-05-04 12:00:18.684 Info:   initialPlan=SPARQLModule[
2017-05-04 12:00:18.684 Info:   Prolog[]
2017-05-04 12:00:18.684 Info:   SPARQLSelect[SPARQLLimit[
2017-05-04 12:00:18.684 Info:       LIMIT GraphNode[Literal "20"^^<http://www.w3.org/2001/XMLSchema#integer>]
2017-05-04 12:00:18.684 Info:       SPARQLProject[order(1)
2017-05-04 12:00:18.684 Info:         GraphNode[Var this_0 0]
2017-05-04 12:00:18.684 Info:         GraphNode[Var name_1 1]
2017-05-04 12:00:18.684 Info:         SPARQLOrder[order(1) UNSORTED
2017-05-04 12:00:18.684 Info:           OrderSpec[
2017-05-04 12:00:18.684 Info:             Variable[QName[(Unknown) name_1] 1]
2017-05-04 12:00:18.684 Info:             ASCENDING EMPTY MIN]
2017-05-04 12:00:18.684 Info:           SPARQLMergeJoin[order(0) hash(0==0) scatter()
2017-05-04 12:00:18.684 Info:             TriplePattern[order(0,1) PSO
2017-05-04 12:00:18.684 Info:               GraphNode[Var this_0 0]
2017-05-04 12:00:18.684 Info:               GraphNode[IRI <http://mycompany.com/geonames-jurisdiction/1.0/schema#localCountyName>]
2017-05-04 12:00:18.684 Info:               GraphNode[Var name_1 1]]
2017-05-04 12:00:18.684 Info:             TriplePattern[order(0) OPS
2017-05-04 12:00:18.684 Info:               GraphNode[Var this_0 0]
2017-05-04 12:00:18.684 Info:               GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-05-04 12:00:18.684 Info:               GraphNode[IRI <http://mycompany.com/geonames-jurisdiction/1.0/schema#LocalCounty>]]]]]]]]
2017-05-04 12:00:18.684 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 optimize=1 r=3 t=1.28811 os=360 is=15 mutations=9 seed=15212683942933123635
2017-05-04 12:00:18.684 Info:   initialCost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20])
2017-05-04 12:00:18.726 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 diff=0 diff%=0 r=0
2017-05-04 12:00:18.726 Info:   cost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20])
2017-05-04 12:00:18.726 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 diff=0 diff%=0 r=1
2017-05-04 12:00:18.726 Info:   cost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20])
2017-05-04 12:00:18.728 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525 diff=0 diff%=0 r=2
2017-05-04 12:00:18.728 Info:   cost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20])
2017-05-04 12:00:18.728 Info: [Event:id=SPARQL Cost Analysis] sessionKey=7777437449602930525
2017-05-04 12:00:18.728 Info:   bestCost=(m:6.02656e+07,r:0,io:(52.931/1.20414e+07/0),cpu(2):(0/116805/0),mem:87603,c:20,crd:[20,20])
2017-05-04 12:00:18.729 Info: [Event:id=SPARQL AST] sessionKey=7777437449602930525
2017-05-04 12:00:18.729 Info:   plan=SPARQLModule[
2017-05-04 12:00:18.729 Info:   Prolog[]
2017-05-04 12:00:18.729 Info:   SPARQLSelect[SPARQLLimit[
2017-05-04 12:00:18.729 Info:       LIMIT GraphNode[Literal "20"^^<http://www.w3.org/2001/XMLSchema#integer>]
2017-05-04 12:00:18.729 Info:       SPARQLProject[order(1)
2017-05-04 12:00:18.729 Info:         GraphNode[Var this_0 0]
2017-05-04 12:00:18.729 Info:         GraphNode[Var name_1 1]
2017-05-04 12:00:18.729 Info:         SPARQLOrder[order(1) UNSORTED
2017-05-04 12:00:18.729 Info:           OrderSpec[
2017-05-04 12:00:18.729 Info:             Variable[QName[(Unknown) name_1] 1]
2017-05-04 12:00:18.729 Info:             ASCENDING EMPTY MIN]
2017-05-04 12:00:18.729 Info:           SPARQLMergeJoin[order(0) hash(0==0) scatter()
2017-05-04 12:00:18.729 Info:             TriplePattern[order(0,1) PSO
2017-05-04 12:00:18.729 Info:               GraphNode[Var this_0 0]
2017-05-04 12:00:18.729 Info:               GraphNode[IRI <http://mycompany.com/geonames-jurisdiction/1.0/schema#localCountyName>]
2017-05-04 12:00:18.729 Info:               GraphNode[Var name_1 1]]
2017-05-04 12:00:18.729 Info:             TriplePattern[order(0) OPS
2017-05-04 12:00:18.729 Info:               GraphNode[Var this_0 0]
2017-05-04 12:00:18.729 Info:               GraphNode[IRI <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>]
2017-05-04 12:00:18.729 Info:               GraphNode[IRI <http://mycompany.com/geonames-jurisdiction/1.0/schema#LocalCounty>]]]]]]]]

How big is the total number of results that match the query? How is the performance without `ORDER BY` - I'm asking because this basically needs to run over the whole data that matches the graph pattern — UninformedUser, May 02 '17 at 14:56
If I remove the LIMIT clause and do a count, it counts ~29,000 triples. — RMorrisey, May 02 '17 at 15:21
Ok, and what about removing the `ODER BY`? This should be much faster. — UninformedUser, May 02 '17 at 20:34
It does respond much faster without the order by... but I don't think that's helpful. What I need is something like "create index on gn:name order by value ascending" — RMorrisey, May 02 '17 at 20:38
Then you should ask the Blazegraph developers, I'm pretty sure they have some support. This is very tool specific. — UninformedUser, May 03 '17 at 09:47
I had the same problem and I replaces SPQRQL with cts functions. — Nik, May 03 '17 at 12:55
I'm interested in understanding this query and data better -- no reason for it to be a slow-running query if well-optimized. So I'm wondering whether there are small syntax-related bugs that we need to address. Things I'd try -- * replace '*' with specific variables to return. * replace ORDER BY ASC(x) with ORDER BY X — grechaw, May 03 '17 at 15:21
@grechaw Revised the query as you suggested. I didn't see any major improvement (maybe 300 ms less) from those changes. I did get some improvement from copying gn:name to gj:localCountyName, though it's still pretty slow. — RMorrisey, May 04 '17 at 15:50
It seems silly that this query is slow -- I don't like the choice of indexes considering the sort you've requested. I know this is just poking at a query to see if you can affect the plan, but my next suggestion would be to remove the rdf:type predicate. My assumption is that the rdf:type predicate is redundant -- and I could be wrong. Next I'd try passing an optimize argument. Calling from builtin sem:sparql that would be an option "optimize=2" as third argument. over REST there's a parameter optimize=2. This option increases the time spent on finding an optimal query plan. — grechaw, May 05 '17 at 17:33

score 1 · Answer 1 · answered May 03 '17 at 14:39

1

Depending on your hardware (memory, CPU, disks), you may increase performance by increasing the number of forests.

answered May 03 '17 at 14:39

David Ennis -CleverLlamas.com

7,560
12
20

score 0 · Answer 2 · answered Sep 06 '17 at 14:33

0

MarkLogic uses a scale-out architecture, so there isn't any guarantee of scalable performance with a single machine. The best way to scale is to add more nodes, specifically, e-nodes with adequate memory on each.

answered Sep 06 '17 at 14:33

scotthenninger

3,921
1
15
24

How do I improve indexing of large SPARQL datasets?

2 Answers2