0

I am pretty new to sparql and apache Jena so please forgive my naiveness. I loaded wikidata dump (705G) using TDB2 loader and executed some query examples from Wikidata Query Service. Most of the queries take longer time in Jena compare to Wikidata Query Service. My machine is configured with 750G of RAM and 80 CPUs. My questions are:

  1. Why Wikidata service is faster then Jena?
  2. How can I improve query performance without rewriting query? maybe some indexing techniques? Or specific server configurations?

I looked up all stackoverflow questions with [Jena] tag and didn't find anything about it. If you can provide tutorials or topics except official Jena website it would be great.

logi-kal
  • 7,107
  • 6
  • 31
  • 43
  • WDQS runs on a cluster of machines - https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Hardware .There are some custom extensions as well - g.e. the label service. Queries will have been written to suit the WDQS system. – AndyS Oct 14 '21 at 19:29
  • Out of interest - which wikidata dump did you use? Do you have a record of the load and how long the TDB2 (parallel? phas/default?) loader took for each step? – AndyS Oct 14 '21 at 19:33
  • latest-truthy.nt.gz from https://dumps.wikimedia.org/wikidatawiki/entities/. I tried to do it in several ways. Only separate loading using tdbloader2data (17h 25min) and tdbloader2index (10h 7min) was success. I don't have time records about other loadings but they failed with OOM even though i set -Xms600g -Xmx700g. – Aleksei Keks Oct 15 '21 at 07:38
  • Thx. Confusingly, that's a TDB1 loader (legacy naming!). Setting -Xmx700g probably slowed it down - a lot of the first step is due to work is outside the heap. It'll probably be renamed, ported to TDB2 with some speed ups in the data stage and also work better on spinning disks. – AndyS Oct 15 '21 at 08:17

1 Answers1

0

You can try to use the next generation TDB2 (instead of TDB1).

tdb2.tdbloader --loc /path/to/tdb2/ /path/to/some.ttl

Also, building a TDB2 like that does not generate statistics by default. You have to manually do it. First cd to the TDB2 you created (following the example above that is /path/to/tdb2) and run (in bash):

tdb2.tdbstats --loc=`pwd` > /tmp/stats.opt
mv /tmp/stats.opt > /path/to/tdb2/Data-0001/

The statistics "guide the optimizer in choosing one execution plan over another" which could help you achieve better query performance. https://jena.apache.org/documentation/tdb/optimizer.html#running-tdbstats

justin2004
  • 75
  • 5
  • Update: The TDB2 xloader has been used to load wikidata (full dataset as well as truthy). https://jena.apache.org/documentation/tdb/tdb-xloader.html – AndyS Dec 29 '21 at 20:46