This is a very hard task, since Wikipedia's category graph is a mess (technically speaking :-)). Indeed, in a tree you would expect to get to the root node in logarithmic time. But this is not a tree, since any node can have multiple parents!
Furthermore, I think that it can't be accomplished only using categories, because, as you can see in the example, you are very likely going to get unexpected results. Anyway I tried to reproduce something similar to what you asked.
Explanation of the code below:
- Start from a source page (the hardcoded one is "Hamburger");
- Go back visiting recursively all the parent categories;
- Cache all the met categories, in order to avoid visiting twice the same category (and this solves also the cycles problem);
- Cut the current branch if you find a target category;
- Stop when the backlog is empty.
Starting from a given page you are likely getting more than one target category, so I organized the result as a dictionary that tells you how many times a target category you have been met with.
As you may imagine, the response is not immediate, so this algorithm should be implemented in offline mode. And it can be improved in many ways (see below).
The code
import requests
import time
import wikipedia
def get_categories(title) :
try : return set(wikipedia.page(title, auto_suggest=False).categories)
except requests.exceptions.ConnectionError :
time.sleep(10)
return get_categories(title)
start_page = "Hamburger"
target_categories = {"Academic disciplines", "Business", "Concepts", "Culture", "Economy", "Education", "Energy", "Engineering", "Entertainment", "Entities", "Ethics", "Events", "Food and drink", "Geography", "Government", "Health", "History", "Human nature", "Humanities", "Knowledge", "Language", "Law", "Life", "Mass media", "Mathematics", "Military", "Music", "Nature", "Objects", "Organizations", "People", "Philosophy", "Policy", "Politics", "Religion", "Science and technology", "Society", "Sports", "Universe", "World"}
result_categories = {c:0 for c in target_categories} # dictionary target category -> number of paths
cached_categories = set() # monotonically encreasing
backlog = get_categories(start_page)
cached_categories.update(backlog)
while (len(backlog) != 0) :
print("\nBacklog size: %d" % len(backlog))
cat = backlog.pop() # pick a category removing it from backlog
print("Visiting category: " + cat)
try:
for parent in get_categories("Category:" + cat) :
if parent in target_categories :
print("Found target category: " + parent)
result_categories[parent] += 1
elif parent not in cached_categories :
backlog.add(parent)
cached_categories.add(parent)
except KeyError: pass # current cat may not have "categories" attribute
result_categories = {k:v for (k,v) in result_categories.items() if v>0} # filter not-found categories
print("\nVisited categories: %d" % len(cached_categories))
print("Result: " + str(result_categories))
Results for your example
In your example, the script would visit 12176 categories (!) and would return the following result:
{'Education': 21, 'Society': 40, 'Knowledge': 17, 'Entities': 4, 'People': 21, 'Health': 25, 'Mass media': 25, 'Philosophy': 17, 'Events': 17, 'Music': 18, 'History': 21, 'Sports': 6, 'Geography': 18, 'Life': 13, 'Government': 36, 'Food and drink': 12, 'Organizations': 16, 'Religion': 23, 'Language': 15, 'Engineering': 7, 'Law': 25, 'World': 13, 'Military': 18, 'Science and technology': 8, 'Politics': 24, 'Business': 15, 'Objects': 3, 'Entertainment': 15, 'Nature': 12, 'Ethics': 12, 'Culture': 29, 'Human nature': 3, 'Energy': 13, 'Concepts': 7, 'Universe': 2, 'Academic disciplines': 23, 'Humanities': 25, 'Policy': 14, 'Economy': 17, 'Mathematics': 10}
As you may notice, the "Food and drink" category has been reached only 12 times, while, for instance, "Society" has been reached 40 times. This tells us a lot about how weird the Wikipedia's category graph is.
Possible improvements
There are so many improvements for optimizing or approximating this algorithm. The first that come to my mind:
- Consider keeping track of the path length and suppose that the target category with the shortest path is the most relevant one.
- Reduce the execution time:
- You can reduce the number of steps by stopping the script after the first target category occurrence (or at the N-th occurrence).
- If you execute this algorithm starting from multiple articles, you can keep in memory the information which associates eventual target categories to every category that you met. For example, after your "Hamburger" run you will know that starting from "Category:Fast food" you will get to "Category:Economy", and this can be a precious information. This will be expensive in terms of space, but eventually will help you reducing the execution time.
- Use as label only the target categories that are more frequent. E.g. if your result is
{"Food and drinks" : 37, "Economy" : 4}
, you may want to keep only "Food and drinks" as label. For doing this you can:
- take the N most occurring target categories;
- take the most relevant fraction (e.g. the first half, or third, or fourth);
- take the categories which occurr at least N% of times w.r.t. the most frequent one;
- use more sophisticated statistical tests for analyzing statistical significance of frequency.