4

I'm using Python wikipedia library to obtain the list of the categories of a page. I saw it's a wrapper of MediaWiki API.

Anyway I'm wondering how to generalize the categories to marco categories, like these Main topic classifications.

For example if I search the page Hamburger there is a category called German-American cousine, but I would like to get its super category like Food and Drink. How can I do that?

import wikipedia
page = wikipedia.page("Hamburger")
print(page.categories)
# how to filter only imortant categories?
>>>['All articles with specifically marked weasel-worded phrases', 'All articles with unsourced statements', 'American sandwiches', 'Articles with hAudio microformats', 'Articles with short description', 'Articles with specifically marked weasel-worded phrases from May 2015', 'Articles with unsourced statements from May 2017', 'CS1: Julian–Gregorian uncertainty', 'Commons category link is on Wikidata', 'Culture in Hamburg', 'Fast food', 'German-American cuisine', 'German cuisine', 'German sandwiches', 'Hamburgers (food)', 'Hot sandwiches', 'National dishes', 'Short description is different from Wikidata', 'Spoken articles', 'Use mdy dates from October 2020', 'Webarchive template wayback links', 'Wikipedia articles with BNF identifiers', 'Wikipedia articles with GND identifiers', 'Wikipedia articles with LCCN identifiers', 'Wikipedia articles with NARA identifiers', 'Wikipedia indefinitely move-protected pages', 'Wikipedia pages semi-protected against vandalism']

I didn't find an api to go through the hierarchical tree of Wikipedia Categories.

I accept both Python and API requests solutions. Thank you

EDIT: I have found the api categorytree which seems to do something similar to what I need.

enter image description here

Anyway I dint't find the way to insert options parameter as said in the documentation. I think that the options can be those expressed in this link, like mode=parents, but I can't find the way to insert this parameter in the HTTP url, because it must be a JSON object, as said in the documentation. I was trying this https://en.wikipedia.org/w/api.php?action=categorytree&category=Category:Biscuits&format=json. How to insert options field?

Paolo Magnani
  • 549
  • 4
  • 14
  • 1
    `categorytree` is an old and ugly API that was meant for the specific purpose of rendering a category tree in the UI. You are probably better off with [`categories`](https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bcategories) or the [categorylinks dump](https://meta.wikimedia.org/wiki/Data_dumps/What%27s_available_for_download#Database_tables). – Tgr Jan 21 '21 at 02:44
  • May you define "important"? – logi-kal Jan 22 '21 at 11:29
  • @horcrux I don't see where I wrote "important". If you refer to the search of more general categories, my purpose should be finding the highest parent categories to generalize the category of each Wikipedia page. An example of a taxonomy I would like is https://en.wikipedia.org/wiki/Category:Main_topic_classifications – Paolo Magnani Jan 22 '21 at 11:38
  • You said "how to filter only imortant categories?" (there is a typo in "imortant"). So, for defining better your problem: given a category X you tant to get a category Y among the ones in "Category:Main topic classifications" such that X is contained in Y. Am I right? – logi-kal Jan 22 '21 at 11:46
  • You are right @horcrux. That's exactly my purpose :) – Paolo Magnani Jan 22 '21 at 16:29

2 Answers2

4

This is a very hard task, since Wikipedia's category graph is a mess (technically speaking :-)). Indeed, in a tree you would expect to get to the root node in logarithmic time. But this is not a tree, since any node can have multiple parents!

Furthermore, I think that it can't be accomplished only using categories, because, as you can see in the example, you are very likely going to get unexpected results. Anyway I tried to reproduce something similar to what you asked.

Explanation of the code below:

  • Start from a source page (the hardcoded one is "Hamburger");
  • Go back visiting recursively all the parent categories;
  • Cache all the met categories, in order to avoid visiting twice the same category (and this solves also the cycles problem);
  • Cut the current branch if you find a target category;
  • Stop when the backlog is empty.

Starting from a given page you are likely getting more than one target category, so I organized the result as a dictionary that tells you how many times a target category you have been met with.

As you may imagine, the response is not immediate, so this algorithm should be implemented in offline mode. And it can be improved in many ways (see below).

The code

import requests
import time
import wikipedia

def get_categories(title) :
    try : return set(wikipedia.page(title, auto_suggest=False).categories)
    except requests.exceptions.ConnectionError :
        time.sleep(10)
        return get_categories(title)

start_page = "Hamburger"
target_categories = {"Academic disciplines", "Business", "Concepts", "Culture", "Economy", "Education", "Energy", "Engineering", "Entertainment", "Entities", "Ethics", "Events", "Food and drink", "Geography", "Government", "Health", "History", "Human nature", "Humanities", "Knowledge", "Language", "Law", "Life", "Mass media", "Mathematics", "Military", "Music", "Nature", "Objects", "Organizations", "People", "Philosophy", "Policy", "Politics", "Religion", "Science and technology", "Society", "Sports", "Universe", "World"}
result_categories = {c:0 for c in target_categories}    # dictionary target category -> number of paths
cached_categories = set()       # monotonically encreasing
backlog = get_categories(start_page)
cached_categories.update(backlog)
while (len(backlog) != 0) :
    print("\nBacklog size: %d" % len(backlog))
    cat = backlog.pop()         # pick a category removing it from backlog
    print("Visiting category: " + cat)
    try:
        for parent in get_categories("Category:" + cat) :
            if parent in target_categories :
                print("Found target category: " + parent)
                result_categories[parent] += 1
            elif parent not in cached_categories :
                backlog.add(parent)
                cached_categories.add(parent)
    except KeyError: pass       # current cat may not have "categories" attribute
result_categories = {k:v for (k,v) in result_categories.items() if v>0} # filter not-found categories
print("\nVisited categories: %d" % len(cached_categories))
print("Result: " + str(result_categories))

Results for your example

In your example, the script would visit 12176 categories (!) and would return the following result:

{'Education': 21, 'Society': 40, 'Knowledge': 17, 'Entities': 4, 'People': 21, 'Health': 25, 'Mass media': 25, 'Philosophy': 17, 'Events': 17, 'Music': 18, 'History': 21, 'Sports': 6, 'Geography': 18, 'Life': 13, 'Government': 36, 'Food and drink': 12, 'Organizations': 16, 'Religion': 23, 'Language': 15, 'Engineering': 7, 'Law': 25, 'World': 13, 'Military': 18, 'Science and technology': 8, 'Politics': 24, 'Business': 15, 'Objects': 3, 'Entertainment': 15, 'Nature': 12, 'Ethics': 12, 'Culture': 29, 'Human nature': 3, 'Energy': 13, 'Concepts': 7, 'Universe': 2, 'Academic disciplines': 23, 'Humanities': 25, 'Policy': 14, 'Economy': 17, 'Mathematics': 10}

As you may notice, the "Food and drink" category has been reached only 12 times, while, for instance, "Society" has been reached 40 times. This tells us a lot about how weird the Wikipedia's category graph is.

Possible improvements

There are so many improvements for optimizing or approximating this algorithm. The first that come to my mind:

  • Consider keeping track of the path length and suppose that the target category with the shortest path is the most relevant one.
  • Reduce the execution time:
    • You can reduce the number of steps by stopping the script after the first target category occurrence (or at the N-th occurrence).
    • If you execute this algorithm starting from multiple articles, you can keep in memory the information which associates eventual target categories to every category that you met. For example, after your "Hamburger" run you will know that starting from "Category:Fast food" you will get to "Category:Economy", and this can be a precious information. This will be expensive in terms of space, but eventually will help you reducing the execution time.
  • Use as label only the target categories that are more frequent. E.g. if your result is {"Food and drinks" : 37, "Economy" : 4}, you may want to keep only "Food and drinks" as label. For doing this you can:
    • take the N most occurring target categories;
    • take the most relevant fraction (e.g. the first half, or third, or fourth);
    • take the categories which occurr at least N% of times w.r.t. the most frequent one;
    • use more sophisticated statistical tests for analyzing statistical significance of frequency.
logi-kal
  • 7,107
  • 6
  • 31
  • 43
  • This is awesome! Thanks for this complete example! Only one question: if you do `for parent in get_categories("Category:" + cat) :` you are searching for the categories of the page of another category. Is this searching the parent categories or only the related categories? Because I thought to do something like this, but I thought the categories of a Category page aren't the parent categories, but only related ones. – Paolo Magnani Jan 25 '21 at 11:31
  • 2
    What do you mean with "related categories"? If page A (article or category) is contaned in category B (i.e. it declares `[[Category:B]]` in its source code) then B is a parent category of A. – logi-kal Jan 25 '21 at 11:50
  • 1
    Ah ok! so you mean that if I search for a Category page and see its categories, those are its parent categories. I thought they were categories linked to that category, not necessarily parents! – Paolo Magnani Jan 25 '21 at 11:58
  • Try to do some tests ;-) Write [[Category:B]] in a category page A and look if A appears in B as subcategory. – logi-kal Jan 25 '21 at 12:04
  • 1
    Instead, if you write `[[:Category:B]]` (notice the colons) you are just linking it. See also: [Categorization](https://en.wikipedia.org/wiki/Wikipedia:Categorization) and [How to link to a category](https://en.wikipedia.org/w/index.php?title=Wikipedia:How_to_link_to_a_category) – logi-kal Jan 25 '21 at 12:07
  • You are right. It seemed trivial, but it wasn't :) So given this explanation, it seems not so good the approach to check parent categories if **Hamburger** results `Society` and not `Food and Drinks` – Paolo Magnani Jan 25 '21 at 15:35
2

Something a bit different you can do is getting the machine-predicted article topic, with a query like https://ores.wikimedia.org/v3/scores/enwiki/?models=articletopic&revids=1000459607

Tgr
  • 27,442
  • 12
  • 81
  • 118
  • 1
    This is really interesting. Do you think my purpose is impossible to achieve with the existing api? – Paolo Magnani Jan 20 '21 at 09:47
  • 2
    Not impossible but harder than you probably expect. Categories in MediaWiki are not a tree, or even a DAG, and there can be a huge amount of them on large wikis, so you'd have to do some kind of heuristic graph traversal, or download and locally preprocess the whole category graph. – Tgr Jan 21 '21 at 02:36