0

I'm trying to retrieve all elements of a rdf:Seq with SPARQL. The RDF structure is as follows. A subproject with a rdf:Seq of timeclaims and the individual timeclaim information. The list of timeclaims for a subproject can be of any length:

<rdf:Description rdf:about="http://www.example.com/resource/subproject/2017-nieuw-1">
  <rdf:type rdf:resource="http://www.example.com/ontologie/example/Subproject"/>
  <rdfs:label>Subproject label</rdfs:label>
  <pbl:subproject_timeclaims rdf:resource="http://www.example.com/resource/list/5853abbfdcc97"/>
</rdf:Description>

<rdf:Description rdf:about="http://www.example.com/resource/list/5853abbfdcc97">
  <rdf:type rdf:resource="http://www.w3.org/1999/02/22-rdf-syntax-ns#Seq"/>
  <rdf:_1 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfd6aa4"/>
  <rdf:_7 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfd957b"/>
  <rdf:_6 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfd8e68"/>
  <rdf:_14 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfdc541"/>
  <rdf:_5 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfd879f"/>
  <rdf:_2 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfd71db"/>
  <rdf:_3 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfd78be"/>
  <rdf:_4 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfd7f92"/>
  <rdf:_8 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfd9c4c"/>
  <rdf:_9 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfda31c"/>
  <rdf:_10 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfdaa08"/>
  <rdf:_11 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfdb0e6"/>
  <rdf:_12 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfdb7bd"/>
  <rdf:_13 rdf:resource="http://www.example.com/resource/timeclaim/5853abbfdbe7f"/>
</rdf:Description>

<rdf:Description rdf:about="http://www.example.com/resource/timeclaim/5853abbfdc541">
  <rdf:type rdf:resource="http://www.example.com/ontologie/example/Timeclaim"/>
  <pbl:timeclaim_description>Description</pbl:timeclaim_description>
  <pbl:timeclaim_hours>25</pbl:timeclaim_hours>
  <pbl:timeclaim_employee 
     rdf:resource="http://www.example.com/resource/employee/2222333334444"/>
</rdf:Description>

Starting from the timeclaims I'm trying to retrieve the information of the subproject above (and filter on it). But the query is taking forever. Eventually the data is returned but I have the feeling it could be quicker.

SELECT *
WHERE {
  ?tc_item a :Timeclaim .
  ?tc_list ?p ?tc_item .
  ?subproject pbl:subproject_timeclaims ?tc_list
}

Could you point out any mistakes in the SPARQL query and better ways of doing this? Or maybe the RDF structure could be improved? The numbering in this case is not really relevant but the same list structure with rdf:Seq is present in more places in the database (and the order is important in those cases).

  • what means "slow" and how large is the data? You could basically try to reorder the triple patterns manually, if that somehow affects the query optimizer and or execution. Other than that, the first triple pattern looks more or less redundant. If all this doesn't help, and we do not know the memory settings you used, you could load the data into a "proper" tripe store, i.e. some engine that does index the data to disk. Whether this is really necessary depends on the dataset size and your machine. In general, RDF4J should be fast enough. – UninformedUser Apr 01 '21 at 12:33
  • Slow means more than 30 seconds. The data to be returned is not that large. Approximately thousand records. – user1492600 Apr 01 '21 at 16:14
  • @user1492600 what kind of store are you using (memory, native, something else)? How is it configured (indexes, inferencing)? And roughly speaking how large is your dataset (total number of triples in the database). – Jeen Broekstra Apr 03 '21 at 01:49
  • Can I assume from your focus on the database configuration and size that the SPARQL query and the setup of the RDF structure is correct? – user1492600 Apr 03 '21 at 20:52
  • @user1492600 both are technically correct, but whether they are the best way to model things is a different matter. For now though let's focus on your question about performance. – Jeen Broekstra Apr 05 '21 at 22:33
  • Okay, if the RDF and SPARQL are not the main bottlenecks then some details of my database and servers. The database is a native store of circa 500M triples. On my development VM I'm using RDF4J 3.4.2 on Tomcat 8. I see in Workbench that max memory is ca. 500M. – user1492600 Apr 06 '21 at 14:14
  • Correction: not 500M triples but 500.000 triples. – user1492600 Apr 06 '21 at 18:08
  • I have been experimenting with higher heap sizes for Tomcat 8. I went up to 3Gb but to no avail. The query to the REST endpoint still takes about 10 seconds. This is a different query then the one above. – user1492600 Apr 06 '21 at 20:15
  • There is not enough information here to provide an answer, but suffice to say that on that size of data set, this kind of query should definitely not take 10 seconds or more, even with "only" 500M heap. It's worth looking at the indexing strategy of your native store, and perhaps use RDF4J's query explain feature to figure out the bottleneck. See https://rdf4j.org/documentation/programming/repository/#explaining-queries – Jeen Broekstra Apr 07 '21 at 14:01
  • 1
    I was wondering if there was such a feature in RDF4J. But unfortunately I'm using PHP to access the REST API of RDF4J, so I can't use this function. I added another index today (opsc) but that didn't make any difference either. So I tried to to make the query better. By moving some triple patterns around I managed to get response time below 1 second. Low enough to make it usable in my application. Bedankt voor het advies ;-) – user1492600 Apr 07 '21 at 19:01

0 Answers0