0

I have a series of simple but exhaustive SPARQL queries. Running them against public SPARQL endpoint of WikiData results in timeouts. Setting up local instance of WikiData would be serious investment not worth this time. So I started with a simple solution:

  1. I use SPARQL WikiData endpoint to explore data, tune the query and evaluate its results. I use LIMIT 100 to avoid timeouts
  2. Once I got my query tuned, I translate it manually to a set of series of JSON paths queries, Python filters, etc. to run them over my local dump of WikiData.
  3. I run them locally. It takes time to process whole dump sequentially, but works.

Second step is error-prone and time-consuming. Is there an automatic solution that can execute SPARQL queries (or rather subset of SPARQL) over a local dump without setting up database?

My SPARQL queries are pretty simple: they extract entities based on their properties and values. I do not build large graphs, I do not use any transitive properties.

Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58
dzieciou
  • 4,049
  • 8
  • 41
  • 85
  • how should the second step be faster without a database? I mean, indeed loading the JSON dump and then running the queries would work. But for each query the whole file would have to be scanned or not? The purpose of databases and even MongoDB or something else for the JSON dump is to make use of indexing for later querying. Even if your query does work on a single line in the JSON dump which in fact describes a single Wikidata entity, you would still have to walk through the whole file. – UninformedUser Dec 11 '20 at 06:40
  • 1
    what you could do is to have a look at some existing approaches to handle Wikidata, e.g. https://github.com/usc-isi-i2/kgtk/ - I can't say that much about this tool though – UninformedUser Dec 11 '20 at 06:43
  • Indeed, you could also do anything via bash script and some json tool. Like here described: https://lucaswerkmeister.de/posts/2017/09/03/wikidata+dgsh/ – UninformedUser Dec 11 '20 at 06:44
  • @UninformedUser I haven't said t would be faster. I mean public endpoint return timeouts in case of exhaustive queries to avoid DDoS and users like me. Public endpoints are more for exploring WikiData rather then getting all data from them. – dzieciou Dec 11 '20 at 17:02
  • @UninformedUser Thanks for the link. `jq` + other tools seems the fastest way to go. – dzieciou Dec 11 '20 at 17:05
  • 1
    that's true. My point was more that loading it into a database could be more efficient in the end, depending in the complexity of your query it will for sure be given that the approach I linked to was doing `jq` just on a single line which means a single entity. But indeed you know your workload better than me and yes, loading Wikidata is time consuming - I did it several times. – UninformedUser Dec 11 '20 at 17:57
  • Prefilter it. Good old `grep` can help a lot to shrink the number of items that need to be loaded. The bottleneck may even be the decompression-on-the-fly of the dump, so try unpacking it beforehand. – rwst Jan 07 '21 at 17:54

0 Answers0