2

I recently found that wikipedia has Wikiprojects that are categorised based on discipline (https://en.wikipedia.org/wiki/Category:WikiProjects_by_discipline). As shown in the link it has 34 disciplines.

I would like to know if it is possible to get all the wikipedia articles that is related to each of these wikipedia disciplines.

For example, consider WikiProject Computer science‎. Is it possible to get all the computer science related wikipedia articles using WikiProject Computer science‎ category? If so, are there any data dumps related to it or is there any other way to obtain these data?

I am currently using python (i.e. pywikibot and pymediawiki). However, I am happy to receive answers in other languages as well.

I am happy to provide more details if needed.

EmJ
  • 4,398
  • 9
  • 44
  • 105
  • May be you can use Wikipedia API to get the resources that you want from Wikipedia, check out [this link](https://www.mediawiki.org/wiki/API:Main_page) – Ali Feb 17 '19 at 04:33
  • @AliCSE Thank you for the comment. I could not figure out how mediawiki API can be use to accomplish this task. Do you have any suggestions? :) – EmJ Feb 17 '19 at 05:06
  • 1
    How do you want the content? Is `html` format is Ok? If Yes, I can do the code for that using Selenium or some API libraries to fetch that article but the document style will not be proper? – Ali Feb 17 '19 at 12:47
  • @AliCSE Thank you very much for the comment. Given the link of wikiproject (e.g., in computer science it is https://en.wikipedia.org/wiki/Category:WikiProject_Computer_science_articles) I want to get `only the name of the pages` of it (in the above link they have 7,186 pages in total). i.e., ` Talk:.dbf, Talk:.onion, Talk:(1+ε)-approximate nearest neighbor search, Talk:/bin, Talk:/bin/bash, ......` etc. Please let me know your thoughts. I am happy to provide more details if needed. Thank you once again :) – EmJ Feb 17 '19 at 13:19
  • 1
    I have added the code but in the JavaScript. You can use it as a reference and you can fetch the data using a program of your choice. Let me know if you have any doubts in that... Thank you... – Ali Feb 17 '19 at 16:31

3 Answers3

3

As I suggested and adding to @arash's answer, you can use the Wikipedia API to get the Wikipedia data. Here is the link with the description about how to do that, API:Categorymembers#GET_request

As you commented that you need to fetch the data using program, below is the sample code in JavaScript. It will fetch the first 500 names from Category:WikiProject_Computer_science_articles and displays as output. You can convert the language of your choice based on this example:

// Importing the module
const fetch = require('node-fetch');

// URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

// Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        console.log(t.query.categorymembers[i].title);
    }
});

To write the data into a file, you can do like below :

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Initializing an empty array
    let titles = [];
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        let title = t.query.categorymembers[i].title;
        console.log(title);
        titles[i] = title;
    }
    fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

The above one will store the data in a file with , separated because we using the JavaScript Array there. If you want to store in each line without commas then you need to do like this:

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
    // Getting the length of the returned array
    let len = t.query.categorymembers.length;
    // Initializing an empty array
    let titles = '';
    // Iterating over all the response data
    for(let i=0;i<len;i++) {
        // Printing the names
        let title = t.query.categorymembers[i].title;
        console.log(title);
        titles += title + "\n";
    }
    fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

By using the cmlimit, we can't fetch more than 500 titles so we need to use cmcontinue for checking and fetching the next pages...

Try the below code which fetches all the titles of a particular category and prints, appends data to a file :

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
var url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmlimit=500";

// Method to fetch and append the data to a file 
var fetchTheData = async (url, index) => {
    return await fetch(url).then(res => res.json()).then(data => {
        // Getting the length of the returned array
        let len = data.query.categorymembers.length;
        // Initializing an empty string
        let titles = '';
        // Iterating over all the response data
        for(let i=0;i<len;i++) {
            // Printing the names
            let title = data.query.categorymembers[i].title;
            console.log(title);
            titles += title + "\n";
        }
        // Appending to the file
        fs.appendFileSync('pathtotitles\\titles.txt', titles);
        // Handling an end of error fetching titles exception
        try {
            return data.continue.cmcontinue;
        } catch(err) {
            return "===>>> Finished Fetching...";
        }
    });
}

// Method which will construct the next URL with next page to fetch the data
var constructNextPageURL = async (url) => {
    // Getting the next page token
    let nextPage = await fetchTheData(url);
    for(let i=1;i<=14;i++) {
        await console.log("=> The next page URL is : "+(url + '&cmcontinue=' + nextPage));
        // Constructing the next page URL with next page token and sending the fetch request
        nextPage = await fetchTheData(url + '&cmcontinue=' + nextPage);
    }
}

// Calling to begin extraction
constructNextPageURL(url);

I hope it helps...

Ali
  • 1,689
  • 1
  • 5
  • 12
  • Thanks a lot. I will run this code and let you know how it performed :) – EmJ Feb 17 '19 at 22:34
  • 1
    Please, let me know. If the above doesn't work then I will try to implement solution in your `python` language... – Ali Feb 18 '19 at 04:31
  • thanks a lot for your comment. However, I still could not run your code (as I am new to JS and need read how to setup the environment to run). I will let you know if I could run you code in next 2-3 hours (as I am in a lecture now) :) – EmJ Feb 18 '19 at 05:19
  • 1
    Sure, here is some info - download `node js` and `npm`, install `node-fetch` and try to run the above code. – Ali Feb 18 '19 at 05:28
  • Thanks a lot. This is very helpful. I will try this and let you know :) – EmJ Feb 18 '19 at 06:42
  • Thank you very much. I could successfully run the code. Just wondering if it is possible to write a results into text file without printing in the command prompt. Looking forward to hearing from you. Thank you once again :) – EmJ Feb 18 '19 at 08:00
  • 1
    Welcome... Yes, we can write data to the file using NodeJS `fs` module. I have updated the code, check and let me know if you need anything... You can comment that printing part if you want to... – Ali Feb 18 '19 at 09:19
  • Hi, just one more question. If I would need to get the 7559 results of this link, how should I change the above code? https://tools.wmflabs.org/enwp10/cgi-bin/list2.fcgi?run=yes&projecta=Computer_science&namespace=&pagename=&quality=&importance=&score=&limit=250&offset=1&sorta=Importance&sortb=Quality I look forward to hearing from you. Thank you :) – EmJ Feb 18 '19 at 23:01
  • 1
    Check the updated code and let me know this is want you want or not? Thank you... – Ali Feb 19 '19 at 05:18
  • Thanks a lot. This is impressive. One more question (as I am new to JS). If I want to change the URL into this https://tools.wmflabs.org/enwp10/cgi-bin/list2.fcgi?run=yes&projecta=Computer_science&namespace=&pagename=&quality=&importance=&score=&limit=250&offset=1&sorta=Importance&sortb=Quality, what are the changes I should do the above code? Looking forward to hearing from you. Thank you very much once again :) – EmJ Feb 19 '19 at 06:25
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/188655/discussion-between-ali-cse-and-emi). – Ali Feb 19 '19 at 08:21
  • I tried to use the API that you provided me. However, it does not seem to return the `Article` list of `Computer Science` wikiproject https://tools.wmflabs.org/enwp10/cgi-bin/list2.fcgi?run=yes&projecta=Computer_science&namespace=&pagename=&quality=&importance=&score=&limit=250&offset=1&sorta=Importance&sortb=Quality Just wondering why that happens. Looking foward to hearing from you :) – EmJ Feb 22 '19 at 11:59
2

You can use API:Categorymembers to get the list of sub categories and pages. set "cmtype" parameter to "subcat" to get subcategories and "cmnamespace" to "0" to get articles.

Also you can get the list from database (category hierarchy information in categorylinks table and article information in page table)

Arash
  • 155
  • 5
  • Thank you very much for the answer. It would be really great if you can show me how I can do this using code (as I am still trying to figure out how to use the API and database). Looking forward to hearing from you. Thank you once again :) – EmJ Feb 17 '19 at 11:46
  • Should `cmtitle` be `WikiProject Computer science‎`? :) – EmJ Feb 17 '19 at 12:00
  • I think you should check PetScan tool in Meta. petscan can list pages in category trees, with specific templates, or links from/to specific pages: https://petscan.wmflabs.org you can find the source code here: https://bitbucket.org/magnusmanske/petscan – Arash Feb 19 '19 at 17:09
0

Came across this page in my google results, am leaving some working code here for posterity. This will interact with Wikipedia's api directly, won't use pywikibot or pymediawiki.

Getting the article names is a 2-step process. Because the members of a category are not the articles themselves, but their talk pages. So first we get the talk pages, and then we have to get the parent pages, the actual articles.

(For more info on the parameters used in the API requests, check the pages for querying category members, and querying page info.)

import time
import requests
from datetime import datetime,timezone
import json

utc_time_now = datetime.now(timezone.utc)
utc_time_now_string =\
utc_time_now.replace(microsecond=0).replace(tzinfo=None).isoformat() + 'Z'

api_url = 'https://en.wikipedia.org/w/api.php'
headers = {'User-Agent': '<Your purpose>, owner_name: <Your name>, 
          email_id: <Your email id>'}
        # or you can follow instructions at 
        # https://www.mediawiki.org/wiki/API:Etiquette#The_User-Agent_header

category = "Category:WikiProject_Computer_science_articles"

combined_category_members = []

params = {
        'action': 'query',
        'format': 'json',
        'list':'categorymembers',
        'cmtitle': category,
        'cmprop': 'ids|title|timestamp',
        'cmlimit': 500,
        'cmstart': utc_time_now_string,
        # you can also put a 'cmend': '20210101000000' 
        # (that YYYYMMDDHHMMSS string stands for 12 am UTC on Nov 1, 2021)
        # this then gathers category members added from now till value for 'cmend'
        'cmdir': 'older',
        'cmnamespace': '0|1',
        'cmsort': 'timestamp'
}

response = requests.get(api_url, headers=headers, params=params)
data = response.json()
category_members = data['query']['categorymembers']
combined_category_members.extend(category_members)

while 'continue' in data:
    params.update(data['continue'])
    time.sleep(1)
    response = requests.get(api_url, headers=headers, params=params)
    data = response.json()
    category_members = data['query']['categorymembers']
    combined_category_members.extend(category_members)

#now we've gotten only the talk page ids so far
#now we have to get the parent page ids from talk page ids

final_dict = {}

talk_page_id_list = []
for member in combined_category_members:
    talk_page_id = member['pageid']
    talk_page_id_list.append(talk_page_id)

while talk_page_id_list: #while not an empty list
    fifty_pageid_batch = talk_page_id_list[0:50]
    fifty_pageid_batch_converted = [str(number) for number in fifty_pageid_batch]
    fifty_pageid_string = '|'.join(fifty_pageid_batch_converted)
    params = {
            'action':   'query',
            'format':   'json',
            'prop':     'info',
            'pageids':  fifty_pageid_string,
            'inprop': 'subjectid|associatedpage'
            }
    time.sleep(1)
    response = requests.get(api_url, headers=headers, params=params)
    data = response.json()
    for talk_page_id, talk_page_id_dict in data['query']['pages'].items():
        page_id_raw = talk_page_id_dict['subjectid']
        page_id = str(page_id_raw)
        page_title = talk_page_id_dict['associatedpage']
        final_dict[page_id] = page_title

    del talk_page_id_list[0:50] 

with open('comp_sci_category_members.json', 'w', encoding='utf-8') as filex:
    json.dump(final_dict, filex, ensure_ascii=False)