0

I'm trying to query all the Wikipedia articles about places (have to be geolocated) in the United Kingdom. I'm using the SPARQL wrapper for python for my query to access the coordinates, article link, hierarchy and other metadata. and it looks like this:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?id ?label ?link ?lat ?long ?cat_lab ?cat_lab2 ?nchar
WHERE  {  

?uri a dbo:Place .        

?uri rdfs:label ?label . FILTER(lang(?label) = 'en') .

?uri dbo:wikiPageID ?id .

?uri rdf:type ?cat . FILTER (?cat LIKE <http://dbpedia.org/ontology/%>). 
?cat rdfs:subClassOf ?cat2 . FILTER (?cat2 LIKE <http://dbpedia.org/ontology/%> AND
                                     ! ?cat2 LIKE <http://dbpedia.org/ontology/Place> AND
                                     ! ?cat2 LIKE <http://dbpedia.org/ontology/Location>) .
?cat rdfs:label ?cat_lab . FILTER(lang(?cat_lab) = 'en')
?cat2 rdfs:label  ?cat_lab2 . FILTER(lang(?cat_lab2) = 'en')

?uri geo:lat ?lat . 

?uri geo:long ?long . 

?uri dbo:wikiPageLength ?nchar .

?uri prov:wasDerivedFrom ?link .

FILTER(?long >= -1.1 AND ?long <= 1.8 AND ?lat >= 51.1 AND ?lat <= 54.27)

} 

LIMIT 10000
OFFSET 0

I query the data by changing the offset of my query in steps of 10'000 (b.c. of the query limit of 10'000 records per query) and then append them to a single data frame. This works fine, though I get a lot of duplicate records, but that's another issue.

However, when I look at the data plotted on a map it appears that the records are incomplete as there are two very distinctive stripes devoid of any records across the whole study area. As it is unlikely that this is the normal spatial distribution of the data and I suspect it has to do with way the database is queried.

Study area with the two stripes of missing data (each dot is a geo-located wiki article)

If I change the extent of the queried spatial bounds to a smaller one, the stripes persist but appear in a different place, sometimes it's even only one stripe. As I'm quite inexperienced with SPARQL, I'm out of ideas how these strange results can occur. Maybe one of you can give me a hint on why the data might look like this.

Cheers!

  • I don't get that part: `?uri rdf:type ?cat . FILTER (?cat LIKE ). ?cat rdfs:subClassOf ?cat2 . FILTER (?cat2 LIKE AND ! ?cat2 LIKE AND ! ?cat2 LIKE ) .` - you already get all places with `?uri a dbo:Place .` , the first pattern wouldn't increase the number of result but lower it – UninformedUser Mar 16 '22 at 15:14
  • using `limit/offset` without `order by` might lead to missing results as there is no guarantee in the order during multiple queries without sorting – UninformedUser Mar 16 '22 at 15:20
  • you should also consider GeoSPARQL extension: `PREFIX rdfs: SELECT DISTINCT ?id ?label ?link ?lat ?long ?nchar ?point WHERE { BIND(bif:st_geomfromtext( "BOX(-1.1 51.1, 1.8 54.27)") as ?box) ?uri a dbo:Place . ?uri rdfs:label ?label . FILTER(lang(?label) = 'en') . ?uri dbo:wikiPageID ?id . ?uri geo:lat ?lat . ?uri geo:long ?long . ?uri dbo:wikiPageLength ?nchar . ?uri prov:wasDerivedFrom ?link . BIND((?long, ?lat) as ?point) FILTER ( (?box, ?point ) ) } LIMIT 1000 OFFSET 0` – UninformedUser Mar 16 '22 at 15:34
  • you can either use your existing bound box which I reused in the query, or define a more precise bounding box aka polygon for UK – UninformedUser Mar 16 '22 at 15:35
  • Hi, thanks for the quick answer! This part of the query is to get some information on the ontology of the articles queried to determine the type of place. But I am considering ditching that part anyway... ordering the data does not really do anything, I get the exact same number of records returned – uomo_di_pietro Mar 16 '22 at 16:18
  • I was aware of the GeoSPARQL extension but I did not find any info anywhere on how to install it for the SPARQL Python wrapper but the bounding box filter is precise enough for my purposes as I will spatially filter the data in R later on – uomo_di_pietro Mar 16 '22 at 16:22
  • you don't have to "install" anything for GeoSPARQL. You're using SPARQLWrapper which does nothing more than sending the query to the DBpedia endpoint which is backed by Virtuoso triple store which has those GeoSPARQL extensions enabled. – UninformedUser Mar 17 '22 at 05:44
  • Regarding the missing "stripe" on the map - let's do it the other way around, can we figure some places that are missing based on that stripe? I mean, we can check for some places if those are contained in the DBpedia dataset and then see if the query really does miss those places – UninformedUser Mar 17 '22 at 05:45
  • For a noob like me it was hard to find out if geosparql already works without any prerequisites. But thanks for clearing that up! – uomo_di_pietro Mar 17 '22 at 15:20
  • What I ended up doing was using smaller bounds to extract the missing records and combine them manually. With smaller bounds I had way less trouble with missing data. I guess the database was overwhelmed with the unusually large number of records returned. But anyway, thank you for your swift help, much appreciate it! – uomo_di_pietro Mar 17 '22 at 15:25

0 Answers0