I'm trying to get a clean knowledge base extracted from wikidata. I would like to end up with many triples such as:
(London, capital of, UK)
(G.W. Bush, occupation, politician)
... ...
If you follow this link you can download some RDF file containing triples. I've downloaded a .nt
file as it seems this fits what I'm after closely. Here is what the content of the file looks like:
<http://www.wikidata.org/entity/Q42> <http://schema.org/description> "scr\u00EDbhneoir Sasanach"@ga .
<http://www.wikidata.org/entity/Q42> <http://schema.org/description> "niv\u00EEskar\u00EA br\u00EEtan\u00EE"@ku-latn .
<http://www.wikidata.org/entity/Q42> <http://schema.org/description> "Panulih jo palawak dari Inggirih"@min .
... ...
Any idea how I could resolve all the URIs? I tried to look for a file mapping URI to clear text but couldn't find anything. In tutorial videos I've been through, they're working with ids such as wdt:P106
or wd:Q42
, and I can see a Q42 in the small snipped I'm showing you here. But it seems there are many very different URIs. Also, do you know how I could filter out anything which is not related to the English Wikipedia?
Any pointer to some good tutorial would also be very much welcome.