2

I am trying to query the frequency of certain attributes in Wikidata, using SPARQL.

For example, to find out what the frequency of different values for gender is, I have the following query:

SELECT ?rid (COUNT(?rid) AS ?count)
WHERE { ?qid wdt:P21 ?rid.
  BIND(wd:Q5 AS ?human)
  ?qid wdt:P31 ?human.
} GROUP BY ?rid

I get the following result:

wd:Q6581097 2752163
wd:Q6581072 562339
wd:Q1052281 223
wd:Q1097630 68
wd:Q2449503 67
wd:Q48270   36
wd:Q44148   8
wd:Q43445   4
t152990852  1
t152990762  1
t152990752  1
t152990635  1
t152775383  1
t152775370  1
t152775368  1
...

I have the following questions regarding this:

  • What do those t152... values refer to?
  • How can I ignore the tuples containing t152...?
    I tried FILTER ( !strstarts(str(?rid), "wd:") ) but it timed out.
  • How can I count the distinct number of answers?
    I tried SELECT (COUNT(DISTINCT ?rid) AS ?count) with the above query, but again it timed out.
Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58
benroth
  • 2,468
  • 3
  • 24
  • 25
  • Why do you count the `?rid`? Shouldn't you count the number of humans, i.e. the `?qid`? Otherwise, using `DISTINCT` would always return 1 as value – UninformedUser Jun 05 '17 at 18:32
  • well, I want to count the number of possible values for gender (female, male, transgender, ... etc) that's why I count result ids (rid) and not humans – benroth Jun 05 '17 at 19:47
  • But `rid` is the gender type, right? What you want is the number of humans that have each gender, or not? A triple is e.g. `:human1 :gender :male .` Without a DISTINCT it works, sure- but what you really want is the number of `qid` If you use `DISTINCT` for the `rid` on which you grouped by, then the count will always be 1. – UninformedUser Jun 05 '17 at 19:51
  • No, I want the number of possibilities for 'gender' in the above case. E.g. |{male, female, transgender}| = 3, irrespective of how many persons are assigned to those categories. – benroth Jun 05 '17 at 19:52
  • I.e. in the example given in the post, the desired count would be 8 (ignoring values starting with t152...). – benroth Jun 05 '17 at 19:54
  • Ok. I still don't get it. What is `qid` then? Just as a hint, you group by a variable `qid` - if you count the DISTINCT values of `qid` then, it will always be 1. By the way, it works for me without a timeout: `SELECT ?rid (COUNT(DISTINCT ?rid) AS ?count) WHERE { VALUES ?human {wd:Q5} ?qid wdt:P31 ?human. ?qid wdt:P21 ?rid. } GROUP BY ?rid` – UninformedUser Jun 05 '17 at 19:59
  • `qid` (short for query id) corresponds to the id of a particular person, while `human` corresponds to the concept/type Q5 (Person) in Wikidata. – benroth Jun 05 '17 at 20:02
  • You can ignore the second and third line of the WHERE body. (It just makes sure that only the gender of people is considered, not of fictional animals, aliens and other fringe stuff in wikidata). – benroth Jun 05 '17 at 20:06
  • Just to clarify: You simply want to have the number of distinct gender values? Or the frequency of each gender value? – UninformedUser Jun 05 '17 at 20:07
  • Those strange t- values can be filtered out with adding `FILTER(isURI(?rid))` – UninformedUser Jun 05 '17 at 20:07
  • ideally I would like to know how to do both – benroth Jun 05 '17 at 20:08
  • You already have the solutions: `SELECT (COUNT(DISTINCT ?rid) AS ?count)` that one returns 17 for me – UninformedUser Jun 05 '17 at 20:09
  • And the other query does return the frequency per each gender – UninformedUser Jun 05 '17 at 20:09
  • Thanks, that helped a lot! FILTER(isURI(?rid)) is much faster than the version I had. Somehow `SELECT (COUNT(distinct ?rid) AS ?count)` still does not yield the expected result for me though (1 1 1 ... instead of the count of different values) – benroth Jun 05 '17 at 21:04
  • You have to remove the `GROUP BY` then. – UninformedUser Jun 05 '17 at 23:45
  • By the way, that's what I meant with using `DISTINCT` on `rid` will lead to 1 values for each gender. It has to be `qid` in that case. – UninformedUser Jun 05 '17 at 23:46
  • @benroth, probably, you will be interested in http://whgi.wmflabs.org/gender-by-language.html – Stanislav Kralin Jun 07 '17 at 12:20

1 Answers1

4

Values starting with t are "skolemized" unknown values (see, e.g., Q2423351 for a person of unknown sex or gender).

In order to improve performance, I suggest you to divide your query into three parts:

  1. All "normal" genders:

    SELECT ?rid (COUNT(?qid) AS ?count) 
    WHERE {
       ?qid wdt:P31 wd:Q5.
       ?qid wdt:P21 ?rid.
       ?rid wdt:P31 wd:Q48264 
    } GROUP BY ?rid ORDER BY DESC(?count)
    

    Please note that, according Wikidata, wd:Q746411 is a subclass of wd:Q48270, etc.

  2. All "non-normal" genders:

    SELECT ?rid (COUNT(?qid) AS ?count) 
    WHERE {
       ?qid wdt:P31 wd:Q5.
       ?qid wdt:P21 ?rid.
       FILTER (?rid NOT IN
               (
                wd:Q6581097,
                wd:Q6581072,
                wd:Q1052281,
                wd:Q2449503,
                wd:Q48270,
                wd:Q746411,
                wd:Q189125,
                wd:Q1399232,
                wd:Q3277905
               )
              ).
       FILTER (isURI(?rid))
    } GROUP BY ?rid ORDER BY DESC(?count)
    

    I do not use FILTER NOT EXISTS {?rid wdt:P31 wd:Q48264 } due to performance reasons.

  3. All (i.e. 1) "unknown" genders:

    SELECT (COUNT(?qid) AS ?count) 
    WHERE {
       ?qid wdt:P31 wd:Q5.
       ?qid wdt:P21 ?rid.
       FILTER (!isURI(?rid))
    } 
    

In fact, it is not very important in your case — to count distinct wd:Q5 or count them not distinct — but the latter is preferable due to performance reasons.

Stanislav Kralin
  • 11,070
  • 4
  • 35
  • 58