0

I'm trying to run the "Cats" Wikidata query locally against a 2016 Wikidata dump (.ttl format):

PREFIX bd: <http://www.bigdata.com/rdf#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

SELECT ?item
WHERE
{
  ?item wdt:P31 wd:Q146.
}

To do this, I'm running sparql --data wikidata-20160201-all-BETA.ttl --query cats.rq in the terminal. I got an R5 3600X CPU and 16GB of RAM and the query just stays running for minutes on end, using 70% of the CPU and roughly 4GB of RAM. The query on Wikidata - which currently has several times more data compared to 2016 - runs in under 2 seconds while still fetching labels using SERVICE, which I am not.

I'm using Apache Jena to run SPARQL queries and I've been testing mostly on Windows 10. The queries return correct results instantly for small files, such as the ones from Learning SPARQL, so Apache Jena seems to be configured and working fine. I'm however a complete novice in knowledge bases/Wikidata/SPARQL etc., so maybe I'm messing something up.

Edit: I got this error message after ~20 minutes: Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded.

AtilioA
  • 419
  • 2
  • 6
  • 19
  • 1
    you should increase the `Xmx` param of Java - I mean, you're loading whole Wikidata dump into memory ... I also don't think doing it that way is efficient. For each query it will parse the Turtle dump and load into memory. Obviously, loading it into a triple store - in the Jena case it would be TDB(2) is the way to go. This will also improve query performance as an index will be created and the result of your query can be return almost instantly. The query is trivial, it will use the `pos` index – UninformedUser Sep 12 '20 at 06:29

1 Answers1

3

sparql --data wikidata-20160201-all-BETA.ttl ... is going to try to load the whole of that file into memory before executing the query. wikidata-20160201-all-BETA.ttl is a large file.

Instead, load the file into a TDB database:

tdb2.tdbloader --loc WD wikidata-20160201-all-BETA.ttl

then query that:

tdb2.tdbquery --loc WD --query cats.rq

It won't fit in-memory and there is little point loading it all for a simple lookup. When you use SERVICE and ask the WikiData endpoint, you are querying an already loaded database.

There is a remote SPARQL tool in Jena: rsparql

rsparql --service https://query.wikidata.org/bigdata/namespace/wdq/sparql --query cats.rq

This sends the query to the endpoint given and supports the same output output options as sparql.

AndyS
  • 16,345
  • 17
  • 21
  • This question is also asked on /r/datasets. Please add the answer there. – AndyS Sep 12 '20 at 09:20
  • Thanks. Loading the file into a TDB database errored on Windows after 1 hours or so (`[main] ERROR org.apache.jena.riot - [line: 434032540, col: 14] Illegal escape sequence value: a (0x61) org.apache.jena.riot.RiotException: [line: 434032540, col: 14] Illegal escape sequence value: a (0x61)`)and I'm trying again on Linux. It's been running for almost two hours now but it didn't reach that line. Should I expect the WD folder to be heavier than the dump itself? – AtilioA Sep 12 '20 at 17:30
  • 1
    The data is illegal RDF. I'm afraid you will need to fix up the file before loading. Rather than keep forcing the loader, use `riot` to parse the file and find errors, fix and repeat. (New wikidata dumps are much better in this regard, as well as being much larger). WikiData is a very large dataset; correcting the data and loading it is a major task in itself, especially on a machine with a rotating disk, not SSD. To just learn SPARQL, I suggest using `rsparql` and send the query to the WikiData site or using a much smaller dataset. – AndyS Sep 12 '20 at 18:29