1

I am trying to extract all items of a category on Wikidata, with their respective page title in English. It works ok as long as the category does not contain many items, like this:

SELECT ?work ?workLabel
WHERE
{
  ?work wdt:P31/wdt:P279* wd:Q734454.
  ?work rdfs:label ?workLabel .
  FILTER ( LANGMATCHES ( LANG ( ?workLabel ), "en" ) ) 
}
ORDER BY ?work

but times out (Query timeout limit reached )as soon as I use a category with more items, such as Q2188189. See This example

I have tried using LIMIT or OFFSET clauses but this does not change the result.

I also have tried to insert a filter like this FILTER (regex(?work, '.*Q1.*')) . to slice the query in subsets, also without success (No matching records found).

For now I have only extracted the ids - and then run queries to get the page title for each one of them, but that seems silly.

Is there a way to work around the timeout?

simone
  • 4,667
  • 4
  • 25
  • 47
  • Recursive use of property P279 is always critic. If this is just a theoretical question, I would answer "you cant always go beyond technical limits". Otherwise, I would imagine that you are doing this for a purpose. Therefore, I would ask you: 1. What's the point in obtaining all the music works on Wikidata (there are half a million items of that category!)? Can't you filter the query (reducing its output size) according to your purpose? 2. Are you aware that `rdfs:label` does *not* return you the page title but just the item's label? – logi-kal Feb 08 '21 at 19:02
  • @horcrux - 1. what's the goal: I'm trying to highlight/filter/reprocess links on a wikipedia page that point to a page that is about a musical work. I've tried to check link by link or all links at once using the WikiData API and it's too slow (there are up to 800 links per page) so I'm trying to scrape them into a database - then query times are way faster 2. yes - my mistake. I'll look for the version of the query that looks up the page title - though that one is just as slow. Will just using P31 make things faster? – simone Feb 08 '21 at 19:58

1 Answers1

1

Standard method

If you want the page title of all the music works which have an article on en.wikipedia.org, you must use the following query:

SELECT ?work ?workTitle
WHERE
{
  ?work wdt:P31/wdt:P279* wd:Q2188189.
  ?workLink schema:about ?work ;
    schema:isPartOf <https://en.wikipedia.org/> ;
    schema:name ?workTitle .
}

I tried it three times and two of them it haven't exceed timeout.

Alternative method

If you don't manage to make it work, the only workaround I can imagine is to retrieve all the possible types (i.e. subclasses) of music work, and adapt the above query to the single-class case.

So, the first step is:

SELECT ?workType WHERE { ?workType wdt:P279* wd:Q2188189. }

You'll get more than a thousand results. For each of them (take for example the result Q2743), you'll then have to run the following query:

SELECT ?work ?workTitle
WHERE
{
  ?work wdt:P31 wd:Q2743.
  ?workLink schema:about ?work ;
    schema:isPartOf <https://en.wikipedia.org/> ;
    schema:name ?workTitle .
}

This will return all the items that are directly instances of Q2743, without caring about subclasses.

This method is a bit cumbersome, abut you can use it if you don't care of doing many queries. The idea is to divide the complexity among many queries, so that you will exceed the timeout less likely for each of them.

logi-kal
  • 7,107
  • 6
  • 31
  • 43
  • The first option works, and I can make the second work too, as I won't need ALL subcategories most likely. Thanks a lot. Out of curiosity, I'd love to understand why does the offset/limit trick not work? – simone Feb 08 '21 at 21:02
  • @simone I guess it doesn't because the `ORDER BY` clause imposes to retrieve **all** the results in order to sorting them, and finally pruning them according to the `LIMIT` clause. Try removing the `ORDER BY` and you will find that the "limit trick" will work. – logi-kal Feb 08 '21 at 21:12
  • thanks I should have thought of that. In SQL I would have first extracted the slice and then joined. IÄll try to figure if something like that is possible in SPARQL, I'm only 3 days into it – simone Feb 08 '21 at 21:15