1

I'm trying to get a clean knowledge base extracted from wikidata. I would like to end up with many triples such as:

(London, capital of, UK) 
(G.W. Bush, occupation, politician)
... ...

If you follow this link you can download some RDF file containing triples. I've downloaded a .nt file as it seems this fits what I'm after closely. Here is what the content of the file looks like:

<http://www.wikidata.org/entity/Q42> <http://schema.org/description> "scr\u00EDbhneoir Sasanach"@ga .
<http://www.wikidata.org/entity/Q42> <http://schema.org/description> "niv\u00EEskar\u00EA br\u00EEtan\u00EE"@ku-latn .
<http://www.wikidata.org/entity/Q42> <http://schema.org/description> "Panulih jo palawak dari Inggirih"@min .
... ...

Any idea how I could resolve all the URIs? I tried to look for a file mapping URI to clear text but couldn't find anything. In tutorial videos I've been through, they're working with ids such as wdt:P106 or wd:Q42, and I can see a Q42 in the small snipped I'm showing you here. But it seems there are many very different URIs. Also, do you know how I could filter out anything which is not related to the English Wikipedia?

Any pointer to some good tutorial would also be very much welcome.

user3091275
  • 1,013
  • 2
  • 11
  • 27
  • Could it be that the fastest solution is to have a local SPARQL instance running and just query that instance? – user3091275 Feb 07 '20 at 14:09
  • 1
    if you want labels instead of the URIs then you have to get the lines with the labels in your file first and create a mapping manually. In a second step you can then replace the URIs by their labels. Clearly, this can take some time to process the dump. By the way, the question "why" is what I'd ask you ... and I'd also ask you how you handle synonyms, i.e. even "London" denotes multiple places in the world. That's why RDF uses URIs - which are unique identifiers for an entity. – UninformedUser Feb 14 '20 at 18:54

0 Answers0