1

I'm loading all geographic entries (Q56061) from wikidata json dump. Whole dump contains about 16M entries according to Wikidata:Statistics page.

Using python3.4 + ijson + libyajl2 it comes to take about 93 hours of CPU (AMD Phenom II X4 945 3GHz) time just to parse the file. Using online sequential item queries for total of 2.3M entries of interest comes to take about 134 hours.

Is there some more optimal way to perform this task? (maybe, something like openstreetmap pdf format and osmosis tool)

QwiglyDee
  • 789
  • 8
  • 18
  • that's weird, in my experiments, parsing the whole dump only takes a few hours with a 2.5GHz CPU. I have been extracting all the instances of humans like so: `curl https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.json.gz |gzip -d |wikidata-filter --claim P31:Q5 > humans.ndjson`. I can't recall exactly, but it took definitely less than 5 hours – maxlath Jan 12 '17 at 11:16
  • What is this wikidata-filter ? – QwiglyDee Jan 12 '17 at 14:54
  • If the wikidata-filter can only filter by single claim, it is not satisfactory, because territorial entities are usually in deep class hierarchy (need to get wdt:P31/wdt:P279*) – QwiglyDee Jan 12 '17 at 15:04
  • wikidata-filter http://github.com/maxlath/wikidata-filter – maxlath Jan 12 '17 at 18:02
  • what do you use to walk the graph? – maxlath Jan 12 '17 at 18:06
  • I do not walk the graph. I've queried all the classes using query service, and then filter to match any of them. Over 4000 of them. So, wikidata-filter is not an option. – QwiglyDee Jan 14 '17 at 11:36

1 Answers1

0

My loading code and estimations were wrong.

Using ijson.backends.yajl2_cffi gives about 15 hours for full parsing + filtering + storing to database.

QwiglyDee
  • 789
  • 8
  • 18