1

I'm trying to get a representation of the infobox of articles on Wikipedia in a Python project. I had tried using the Wikipedia API, but the data it outputs is dirty, so I'm trying to move to DBpedia. I need to be able to query by page name, and receive a dictionary of the property names and their values for that page. For example, for the query for London, the returned dictionary would contain:

{dbpedia-owl:PopulatedPlace/areaMetro : 8382.0,
 dbpedia-owl:PopulatedPlace/areaTotal : 1572.0
 .....
 dbpedia-owl:populationDensity : 5285.0
 .....
}

etc., and from this I would be able to read all the keys that were in the Infobox. I did try using the SPARQL query of

describe <http://dbpedia.org/resource/London>

but that returned tonnes of unnecessary data &emdash; the full set of triplets associated with London &emdash; which is many orders of magnitude more than I need.

How can I write a query to just get the infobox properties, as above?

Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
Sam Heather
  • 1,493
  • 3
  • 19
  • 42
  • possible duplicate of [dbpedia extract JSON](http://stackoverflow.com/questions/17755229/dbpedia-extract-json) – Joshua Taylor Jan 21 '15 at 20:41
  • If that duplicate doesn't work for you, do you have a way of identifying which predicates you *do* want? If you can enumerate them, this isn't too hard, but I don't know that there's any way to automatically determine which properties correspond to fields for a certain infobox. – Joshua Taylor Jan 21 '15 at 22:15
  • @JoshuaTaylor the issue is that that returns much more data than necessary - the entire set of triplets if I am correct, with information on them. I'll double check in the morning and update the question if necessary. – Sam Heather Jan 22 '15 at 01:00
  • I guess one approach would to be to only take properties with a given prefix, in which case the kind of filtering in [Exclude results from DBpedia SPARQL query based on URI prefix](http://stackoverflow.com/q/19044871/1281433), [filter out certain properties from sparql query result](http://stackoverflow.com/q/21984461/1281433) may help. – Joshua Taylor Jan 22 '15 at 13:25

2 Answers2

2

You might be able to get what you want by selecting properties and objects where the property IRI begins with something you're interested in (e.g., http://dbpedia.org/ontology/). You could use a query like the following. (It takes advantage of the fact that a prefix by itself, e.g., dbpedia-owl:, is still a legal IRI, and you can use str on it. You could also just use the string http://dbpedia.org/ontology/

select ?p ?o where {
  dbpedia:London ?p ?o
  filter strstarts(str(?p),str(dbpedia-owl:))
}

SPARQL results (HTML Table)
SPARQL results (JSON)

The JSON results aren't quite in the format you're looking for, but are like this:

{ "head": { "link": [], "vars": ["p", "o"] },
  "results": { "distinct": false, "ordered": true, "bindings": [
    { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://mapoflondon.uvic.ca/" }},
    { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.british-history.ac.uk/place.aspx?region=1" }},
    { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.london.gov.uk/" }},
    { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.museumoflondon.org.uk/" }},
    { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.tfl.gov.uk/" }},
    { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.visitlondon.com/" }},
    { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "https://london.gov.uk/" }},
    { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/wikiPageExternalLink" }   , "o": { "type": "uri", "value": "http://www.britishpathe.com/workspace.php?id=2449&delete_record=75105/" }},
    { "p": { "type": "uri", "value": "http://dbpedia.org/ontology/thumbnail" }  , "o": { "type": "uri", "value": "http://commons.wikimedia.org/wiki/Special:FilePath/Greater_London_collage_2013.png?width=300" }},
...

That sort of makes sense though, because there's not necessarily a unique value for each property, so a Python dict as in the question probably isn't the best result format (but it'd be easy to create one where multiple values are put into a list).

Also note that the properties that begin with dbpedia-owl: are actually the DBpedia Ontology properties, which have much cleaner data than the raw infobox values, for which properties beginning with dbpprop: are used. You can read more about the different datasets at 4.3. Infobox Data. A query for the raw properties would be pretty much the same though:

select ?p ?o where {
  dbpedia:London ?p ?o
  filter strstarts(str(?p),str(dbpprop:))
}

SPARQL Results (HTML Table)

Milla Well
  • 3,193
  • 3
  • 35
  • 50
Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
  • This certainly goes some way - thanks. But I am just noticing from the actual http://dbpedia.org/page/London, some information is abstracted to triples from the Infobox (linked content). For example, on the wikipedia page (http://en.wikipedia.org/wiki/London), the infobox has a pairing of `{Mayor : Boris Johnson}`, (where Boris Johnson is a string and link) but in DBPedia this is lost - it becomes simply a list of Leaders (ontology/leaderName). If I was specifically looking for 'Mayor of London', how would I go about that from this returned information? – Sam Heather Jan 22 '15 at 18:40
  • @SamHeather DBpedia doesn't preserve Wikipedia information exactly; it data *based* on it. The infobox fields get mapped to DBpedia ontology properties. If you don't see it somewhere in [http://dbpedia.org/page/London](http://dbpedia.org/page/London), then you probably can't get it, unfortunately. – Joshua Taylor Jan 22 '15 at 21:13
0

To Get Entire data of Page in JSON Format you can also use below method:

Suppose you want JSON data of Taj_Mahal and you have link :

http://dbpedia.org/resource/Taj_Mahal

Now you have to change this URL by replacing /resource/ with /data/ and add .json extension in the end of URL. As given Below:

http://dbpedia.org/data/Taj_Mahal.json

You will get all DBpedia page Matched data with 'Taj_Mahal' in JSON. Now you have to Expand this 'http://dbpedia.org/resource/Taj_Mahal' in JSON to get only data related to that page.

Irshad Khan
  • 5,670
  • 2
  • 44
  • 39