0

I'm trying to retrieve all articles from Wikipedia that are about people. More specifically, I'm looking for:

  • only the page title (and perhaps the page ID)
  • of articles that are about people,
  • separated by gender (for the sake of simplicity, male and female),
  • from the current English Wikipedia.

There are several things I've tried, none of which have worked out:

  • The Wikipedia API lets me search for all pages in a given category. However, searching in "Men" or "Women" fetches mostly subcategory pages, and pages about actual people are buried further down the subcategory hierarchy. I can't find a way to auto-traverse the hierarchy.

  • PetScan lets me specify a hierarchy depth, but requests time out with a depth of more than 3. Also, like the Wikipedia API, results include articles that aren't about people.

  • Wikidata lets me write SPARQL queries to search for entities that have a gender of "male" or "female". This example seems to work, but once I include entity names in the query, it times out. Also, I'm not sure how exactly this data corresponds to Wikipedia articles — is this data guaranteed to be the same as on Wikipedia?

What's the best way to achieve what I'm looking for?

vvye
  • 1,208
  • 1
  • 10
  • 25

1 Answers1

1

I've created a SPARQL-query doing the work. It's important to keep the query as simple as possible (for query optimisation read: https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/query_optimization). Here is the query for SPARQL: https://w.wiki/JhK

For articles of woman this might work with the Wikidata Query Service (WQS), though it's hard on the edge of timing out. So for the male articles (there are more) you need to add a LIMIT and step through it by adding an increasing OFFSET. The WQS seams to still timeout, but there are other endpoints to Wikidata, this one is limited to 100.000 results, but works with increasing OFFSET: https://wikidata.demo.openlinksw.com/sparql

The resulting SPARQL query is something like this:

SELECT ?sitelink
WHERE {
  ?item wdt:P21 wd:Q6581097;
        wdt:P31 wd:Q5.
  ?sitelink schema:about ?item;
  schema:isPartOf <https://en.wikipedia.org/>.
} 
LIMIT 100000 OFFSET 100000
CennoxX
  • 773
  • 1
  • 9
  • 20