1

I just successfully create a local standalone Blazegraph instance and uploaded Wikidata database following the instruction here https://github.com/wikimedia/wikidata-query-rdf/blob/master/docs/getting-started.md.

This is the "super" command I used:

git clone --recurse-submodules https://gerrit.wikimedia.org/r/wikidata/query/rdf wikidata-query-rdf && cd wikidata-query-rdf && mvn package && cd dist/target && unzip service-*-dist.zip && cd service-*/

nohup ./runBlazegraph.sh &

mkdir data && wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-lexemes.ttl.gz && mkdir data/split && ./munge.sh -f latest-lexemes.ttl.gz -d data/split -l en,es -s && ./loadRestAPI.sh -n wdq -d `pwd`/data/split && ./runUpdate.sh -n wdq -l en,es -s

./runUpdate.sh is still running but has already pulled up updates up to 2019-09-23T13:31:56Z

Testing it, I compared my local Wikidata results with Wikidata Query Service results and there are differences.

For instance, if I run the "Cats" query from examples:

#Cats
SELECT ?item ?itemLabel 
WHERE 
{
  ?item wdt:P31 wd:Q146.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

Wikidata Query Service has 142 results. I have NONE.

If I run the "Recent Events" query from examples:

#Recent Events
SELECT ?event ?eventLabel ?date
WHERE
{
    # find events
    ?event wdt:P31/wdt:P279* wd:Q1190554.
    # with a point in time or start date
    OPTIONAL { ?event wdt:P585 ?date. }
    OPTIONAL { ?event wdt:P580 ?date. }
    # but at least one of those
    FILTER(BOUND(?date) && DATATYPE(?date) = xsd:dateTime).
    # not in the future, and not more than 31 days ago
    BIND(NOW() - ?date AS ?distance).
    FILTER(0 <= ?distance && ?distance < 31).
    # and get a label as well
    OPTIONAL {
        ?event rdfs:label ?eventLabel.
        FILTER(LANG(?eventLabel) = "en").
    }
}
# limit to 10 results so we don't timeout
LIMIT 10

Wikidata Query Service returns obviously 10 results. I have ONE.

Why this differences in the results? Is there anything I did wrong?

Thank you in advance.

Additional info about the machine where I'm running Wikidata, just in case it's important.

  • Workstation Dell Precision 7510
  • Ubuntu 18.04.3 LTS 64-bit
  • Memory 32G RAM
  • Processor Intel® Core™ i7-6820HQ CPU @ 2.70GHz × 8
  • Graphics Quadro M2000M/PCIe/SSE2
  • Disk 250Gb SSD
logi-kal
  • 7,107
  • 6
  • 31
  • 43
FranMercaes
  • 151
  • 1
  • 1
  • 12
  • 2
    you just loaded `latest-lexemes.ttl.gz`? What about the rest of the data? I mean, Wikidata is huge, why didn't you at least load the Truthy dataset? Why do you compare querying your tiny dataset to the full dataset loaded in the public endpoint? – UninformedUser Sep 27 '19 at 10:53
  • I'm pretty green on this @AKSW. I just followed the instructions. To have the full data set is enough to load latest-all.ttl.gz only or I have to load latest-all.ttl.gz, latest-lexemes.ttl.gz and latest-truthy.ttl.gz separately? Thank you. – FranMercaes Sep 27 '19 at 11:28
  • according to the docs: https://www.wikidata.org/wiki/Wikidata:Database_download#RDF_dumps - *"The complete dumps together contain all entity information in Wikidata with the exception of order (of aliases, of statements, etc.), which is not naturally represented in RDF. Simplified dumps encode statements that have no qualifiers as single RDF triples (references are omitted)."* - you could start with loading the truthy dataset - that's already very large and will take some time ... – UninformedUser Sep 27 '19 at 12:09

1 Answers1

1

In January 2018 I did a successful Wikidata Import following the instructions you'll find at http://wiki.bitplan.com/index.php/WikiData#Import. My first try with a standard hard disk took so long I estimated a 10 day import time. When I switch to SSD the import time went down to 2.9 days. At the time I needed a 512 GByte SSD to fit the jnl file.

Since 2018-01 Wikidata has grown more so you can expect at least a proportional increase in the import time. There has been some discussion on the importing recently in the Wikidata mailing list so there you'll find hints on alternatives and speed issues.

Before the import is finished you'll not get sensible results because linking triples might not be there yet.

For the cats example my 2018-01 import has 111 results after 2secs. The events example depends on when you run the query and when you did the import and how many events per month are in the period. I changed the 31 days to 600 to get 10 results after some 30 secs. If I run the query with no limits and 31 it will not give a result after 7 hours ...

Wolfgang Fahl
  • 15,016
  • 11
  • 93
  • 186