-1

I need to extract information about articles (e.g., abstract, thumbnail) which located on the different nested subcategories of given category (e.g., History). How can I do that using SPARQL query? Or what is the optimal way to do that on python with a few SPARQL subqueries?

sermal
  • 93
  • 1
  • 6
  • For example, I'm interested in History category. This category has subcategories, each subcategory has subsubcategories and so on. And I want to retrieve all articles from different level of subcategories for History. PREFIX dct: SELECT ?x WHERE { ?x dct:subject } – sermal Aug 03 '17 at 19:12
  • Please edit your question and put the query (Markdown formatted there) - not in a comment – UninformedUser Aug 03 '17 at 19:13
  • 1
    Sub-categories can be retrieved by using the `skos:broader` resp. `skos:narrower` relation. Note, you should limit the depth of traversal as it might be too expensive if the category hierarchy is too big – UninformedUser Aug 03 '17 at 19:15
  • 1
    In addition to using approporiate predicates in your **SPARQL query**, it's helpful to use the right class names in your **SO question**. I think you are looking for things that are `skos:narrower*` than ``, which is a ``, not a "category" – Mark Miller Aug 03 '17 at 19:42

1 Answers1

6

This gets all ?sc "subcategories" that are recursively (or transitively) narrower than "History", up to a depth of 3. I implemented that with the {minDepth,maxDepth} notation that Virtuoso understands. Other triplestores may not understand it. I have also added English-language filtering on string literals, while still retaining triples with IRIs for ?o.

SELECT ?sc ?lab ?p ?o 
WHERE {
  ?sc skos:broader{1,3} <http://dbpedia.org/resource/Category:History> .
  optional {?sc rdfs:label ?lab  } .
  ?sc ?p ?o 
  filter (lang(?lab) = "en")
  filter ((lang(?o) = "en") || isURI(?o))
} 

Additionally, that query reports all of the triples with ?sc as the subject. I didn't see any abstracts (using <http://dbpedia.org/ontology/abstract> as predicate?) or any thumbnail relationships. You can confirm that by projecting only distinct ?p, or even counting:

SELECT ?p (count(?p) as ?pcount)
WHERE {
  ?sc skos:broader{1,3} <http://dbpedia.org/resource/Category:History> .
  optional {?sc rdfs:label ?lab  } .
  ?sc ?p ?o 
  filter (lang(?lab) = "en")
  filter ((lang(?o) = "en") || isURI(?o))
} 
group by ?p
order by desc(?pcount)

If you do deeper recursion, you will find some abstracts. But the deep recursion is slow and I feel like I'm conceptually missing something.

SELECT *
WHERE {
  ?sc skos:broader{5,7} <http://dbpedia.org/resource/Category:History> .
  ?sc <http://dbpedia.org/ontology/abstract> ?a 
} 
Mark Miller
  • 3,011
  • 1
  • 14
  • 34
  • 1
    Nice answer, indeed. One comment, you should mention that using `property{n, m}` in a property path is non-standard SPARQL syntax and just an extension of Virtuoso. It was discussed in the [submission phase](https://www.w3.org/TR/sparql11-property-paths/) but unfortunately never made to the official [W3C recommendation](https://www.w3.org/TR/sparql11-query/#propertypaths). Cheers – UninformedUser Aug 04 '17 at 01:28
  • Thanks! The first query is very to close to goal. How can I add a filter for "en"? The column "p" contains values with "@en", but last column contains rows for different language at this time. – sermal Aug 04 '17 at 20:31
  • Thanks for the feedback. What do you mean the "p" column contains values with "en"? p is the predicate, not a language-typed literal. I have updated the answer to show filtering on ?lab and ?o. note that this will hide ?sc ?p ?o relationships in which ?o is an IRI, as opposed to a literal. For example, that might hide URLs for thumbnail images. – Mark Miller Aug 04 '17 at 22:12
  • @AKSW Is there a W3C standard for transitivity limits? I could swear I've used it with triplestores other than Virtuoso, but maybe my memory is bad. This notation is certainly better than `OPTION ( TRANSITIVE, t_distinct, t_in(?s), t_out(?o), t_min (1), t_max (4)...)` – Mark Miller Aug 04 '17 at 22:23
  • Indeed this notation would be better and more compact than the one which is Virtuoso specific - but there isn't something in the standard as far as I know. I know that some other tools also allow for this extension, but this depends on the triple store. For example, Jena also has some ARQ extension for htis feature, see https://jena.apache.org/documentation/query/property_paths.html – UninformedUser Aug 05 '17 at 09:19