3

I'm trying to retrieve ALL movie titles, with their aliases. I'm using queries like these (with increasing OFFSET) and at first it seems to work:

SELECT ?itemLabel ?itemAltLabel WHERE {
  ?item wdt:P31 wd:Q11424.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en". }
}
LIMIT 1000
OFFSET 0

While it retrieves a lot of valid movie titles, some are missing, although I can find them on the Wikidata site. I can't manage to make some changes to the query (I'm new to SPARQL):

  • For debugging, I want to filter by itemLabel, something like . ?itemLabel = 'fight club'. I tried different options but none worked. Can you help me build such a query?
  • I want to exclude movies that have no itemLabel. These currently return their ID as itemLabel, e.g. "Q12345". How do I add something like . ?itemLabel != ""? Or should it be . ?itemLabel NOT LIKE 'Q[0-9]+' somehow?
  • Sorting: I wonder if the missing titles might be due to not adding any ordering. I'm just running queries with LIMIT 1000 and incrementing the OFFSET with 1000 until there are no results. Could the sorting change between queries? If so, should I just add ORDER BY ?refName?

I could be making some stupid syntax mistakes, so please provide full working queries if you can. If there's anything else you think might prevent me from getting ALL the available titles, let me know.

I'm running the queries here: https://query.wikidata.org/

logi-kal
  • 7,107
  • 6
  • 31
  • 43
Andrei O
  • 325
  • 3
  • 9
  • 2
    Don't use labels for debugging if you know the instance. Use `FILTER(?item = wd:Q190050)` - and it works. As you already recognized, without `ORDER BY` there is no guarantee that you'll get all results by using pagination with `LIMIT n OFFSET n` – UninformedUser Feb 10 '17 at 18:32
  • @AKSW thanks for your comment. If I try to `OREDER BY` `?item` or `?itemLabel`, most often it times out. Is there any other field that I could use, so it doesn't time out? In MySQL I would use an indexed field. – Andrei O Feb 10 '17 at 20:13
  • 1
    No, usually triple stores have indexes for subject (s), predicate (p) and object (o), quite often in several permutations. In your case, `pos` would be used. Nevertheless, `ORDER BY` is expensive and there isn't really a workaround. Note, this is a shared resource, thus, perfomance might be limited. By the way, what is your use case? Maybe you could load the data in your own local triple store. – UninformedUser Feb 10 '17 at 20:59
  • The total number of films is 191173, that'S not that much. It might also be a performance issue to use this "magic" label service. I don't know how this is implemented, but it's non-standard SPARQL. – UninformedUser Feb 10 '17 at 21:01
  • I need up-to-date lists of all valid movie titles, bands, books and a few other types, with aliases. I store them in a relational database and use them for sanitizing user profile data, as part of a recommendation engine pipeline. I've been successfully using http://wdq.wmflabs.org/api for the past year or so, but it recently started to have problems. The maintainer told me it has been deprecated in favour of the Wikidata SPARQL API. – Andrei O Feb 10 '17 at 21:17
  • @AKSW is there a way to achieve my goal without the label service? I'm just using it because that's how I saw in examples. – Andrei O Feb 11 '17 at 11:45
  • I don't know if the problem is caused by the label function - might be more due to load on the server by other people. Standard way without this label service would be to retrieve the `rdfs:label`, i.e. `SELECT ?itemLabel WHERE { ?item wdt:P31 wd:Q11424. ?item rdfs:label ?itemLabel filter(langmatches(lang(?itemLabel), 'en')) } LIMIT 1000 OFFSET 40000` – UninformedUser Feb 11 '17 at 14:39
  • By the way, I think the default limit is 10000, i.e. you could try to increase the page size to it - this would result in just 20 queries compared to 200 when using `LIMIT 1000` – UninformedUser Feb 11 '17 at 14:40

1 Answers1

4

I've encountered this problem before and specifically when dealing with movies :) I think what's happening is that certain entities are instances of sub-categories of film. You want to use the following:

?item wdt:P31/wdt:P279* wd:Q11424 .

to navigate the 'instance of' chain until you hit upon 'film' because if something is, say, an animated film like Toy Story (Q171048) -- which is a subclass of film -- then it wouldn't be returned by the original query.

Hope that helps!