0

I'm trying to get a "category tree" from wikipedia for a project I'm working on. The problem is I only want more common topics and fields of study, so the larger dumps I've been able to find have way too many peripheral articles included.

I recently found the vital articles pages which seem to be a collection of exactly what I'm looking for. Unfortunately I don't really know how to extract the information from those pages or to filter the larger dumps to only include those categories and articles.

To be explicit, my question is: given a vital article level (say level 4), how can I extract the tree of categories and article names for a given list e.g. People, Arts, Physical sciences etc. into a csv or similar file that I can then import into another program. I don't need the actual content of the articles, just the name (and ideally the reference to the article to get more information at a later point).

I'm also open to suggestions about how to better accomplish this task.

Thanks!

Community
  • 1
  • 1
M.K.
  • 3
  • 2

1 Answers1

0

Did you use PetScan?. It's wikimedia based tool that allow extract data from pages based on some conditions.

You can achieve your goal by go the tool, then navigate to "Templates&links" tab, then type the page name in field "Linked from All of these pages:", e.g. Wikipedia:Vital_articles/Level/4/History. If you want to add more than one page in the textarea, just type it line by line.

Finally, press Do it! button, and the data will be generated. After that you can download the data from output tab.

ASammour
  • 865
  • 9
  • 12
  • Thanks! I had been using PetScan but I couldn't figure out the right format for the query. – M.K. Nov 06 '18 at 04:39