Pythonic beautifulSoup4 : How to get remaining titles from the next page link of a wikipedia category

Question

I wrote successfully the following code to get the titles of a Wikipedia category. The category consists more than 404 titles. But my output file gives only 200 titles/pages. how to extend my code to get all the titles of that category's link (next page) and so on.

command : python3 getCATpages.py

Code of getCATpages.py ;-

from bs4 import BeautifulSoup
import requests
import csv

#getting all the contents of a url
url = 'https://en.wikipedia.org/wiki/Category:Free software'
content = requests.get(url).content
soup = BeautifulSoup(content,'lxml')

#showing the category-pages Summary
catPageSummaryTag = soup.find(id='mw-pages')
catPageSummary = catPageSummaryTag.find('p')
print(catPageSummary.text)

#showing the category-pages only
catPageSummaryTag = soup.find(id='mw-pages')
tag = soup.find(id='mw-pages')
links = tag.findAll('a')

# giving serial numbers to the output print and limiting the print into three
counter = 1
for link in links[:3]:
    print ('''        '''+str(counter) + "  " + link.text)
    counter = counter + 1

#getting the category pages
catpages = soup.find(id='mw-pages')
whatlinksherelist = catpages.find_all('li')
things_to_write = []
for titles in whatlinksherelist:
  things_to_write.append(titles.find('a').get('title'))

#writing the category pages as a output file
with open('001-catPages.csv', 'a') as csvfile:
  writer = csv.writer(csvfile,delimiter="\n")
  writer.writerow(things_to_write)

score 3 · Accepted Answer · edited Sep 26 '17 at 22:12

3

The idea is to follow the next page until there is no "next page" link on the page. We'll maintain a web-scraping session while making multiple requests collecting the desired link titles in a list:

from pprint import pprint
from urllib.parse import urljoin

from bs4 import BeautifulSoup
import requests


base_url = 'https://en.wikipedia.org/wiki/Category:Free software'


def get_next_link(soup):
    return soup.find("a", text="next page")

def extract_links(soup):
    return [a['title'] for a in soup.select("#mw-pages li a")]


with requests.Session() as session:
    content = session.get(base_url).content
    soup = BeautifulSoup(content, 'lxml')

    links = extract_links(soup)
    next_link = get_next_link(soup)
    while next_link is not None:  # while there is a Next Page link
        url = urljoin(base_url, next_link['href'])
        content = session.get(url).content
        soup = BeautifulSoup(content, 'lxml')

        links += extract_links(soup)

        next_link = get_next_link(soup)

pprint(links)

Prints:

['Free software',
 'Open-source model',
 'Outline of free software',
 'Adoption of free and open-source software by public institutions',
 ...
 'ZK Spreadsheet',
 'Zulip',
 'Portal:Free and open-source software']

Omitted the irrelevant CSV writing part.

edited Sep 26 '17 at 22:12

Graham

7,431
18
59
84

answered Dec 30 '16 at 05:53

alecxe

462,703
120
1,088
1,195

Some maintenance category consists more than few lakh pages. For example, [https://en.wikipedia.org/wiki/Category:Commons_category_with_local_link_same_as_on_Wikidata 288,935 pages] To avoid server load, is it possible to set 60 seconds time interval in between the next page requests? – info-farmer Dec 30 '16 at 06:22
1

@info-farmer you would need to adjust the code to operate the next pages in sections. And, yes, it would be a good idea to add time delays to not hit wikipedia too often. Good thinking, thanks. Also, see if Scrapy would help to solve this problem of navigating to next pages more easily.. – alecxe Dec 30 '16 at 06:24
Excuse me! I am still practising to understand English/typing as well as programming well. I am contributing to Tamil wiki not English wiki. The above code is very useful for us.If possible, please recode with the time scale. – info-farmer Dec 30 '16 at 06:38
@alecxe. Thanks it's very much informative. How to do the same to extract all the subcategories title in a single code for level 5 depth? Suppose for example, I am interested in extracting all the titles of the below mentioned link. `https://en.wikipedia.org/wiki/Category:Computer_science`. What I need is to extract all the subcategories names (along with all the `recursive` categories names and its associated pages for a specified category (say Computer_science). It should be at-least 5level depth. Thanks in Advance. – M S Oct 06 '18 at 17:07

score 1 · Answer 2 · answered Aug 02 '17 at 17:46

The MediaWiki API provides a generator for doing this. Here is code, adapted from an example provided in MediaWiki, that exploits it.

import requests

def query(request):
    request['action'] = 'query'
    request['format'] = 'json'
    request['generator'] = 'categorymembers'
    request['gcmtype'] = 'subcat'
    previousContinue = {}
    while True:
        req = request.copy()
        req.update(previousContinue)
        result = requests.get('http://en.wikipedia.org/w/api.php', params=req).json()
        if 'error' in result:
            raise Error(result['error'])
        if 'warnings' in result:
            print(result['warnings'])
        if 'query' in result:
            yield result['query']
        if 'continue' in result:
            previousContinue = {'gcmcontinue': result['continue']['gcmcontinue']}
        else:
            break

for result in query({'gcmtitle': 'Category:Free_software' }):
    print (result)

I feel justified in reworking fragmentary code that is presented elsewhere because I don't find the MediaWiki documentation entirely clear.

Here's the output from this script.

{'pages': {'42113821': {'pageid': 42113821, 'ns': 14, 'title': 'Category:Free software by type'}, '6702554': {'pageid': 6702554, 'ns': 14, 'title': 'Category:Free application software'}, '12180074': {'pageid': 12180074, 'ns': 14, 'title': 'Category:Free software by programming language'}, '6962224': {'pageid': 6962224, 'ns': 14, 'title': 'Category:Free software lists and comparisons'}, '39563179': {'pageid': 39563179, 'ns': 14, 'title': 'Category:Bitcoin'}, '34482991': {'pageid': 34482991, 'ns': 14, 'title': 'Category:Free-software awards'}, '30945256': {'pageid': 30945256, 'ns': 14, 'title': 'Category:Single-platform free software'}, '49967344': {'pageid': 49967344, 'ns': 14, 'title': 'Category:Free software by license'}, '6721544': {'pageid': 6721544, 'ns': 14, 'title': 'Category:Free system software'}, '34313543': {'pageid': 34313543, 'ns': 14, 'title': 'Category:Cross-platform free software'}}}
{'pages': {'39630972': {'pageid': 39630972, 'ns': 14, 'title': 'Category:Free and open-source Android software'}, '33751817': {'pageid': 33751817, 'ns': 14, 'title': 'Category:Copyleft'}, '40888749': {'pageid': 40888749, 'ns': 14, 'title': 'Category:Free and open-source software'}, '25128034': {'pageid': 25128034, 'ns': 14, 'title': 'Category:Open data'}, '5446650': {'pageid': 5446650, 'ns': 14, 'title': 'Category:Free software culture and documents'}, '7298930': {'pageid': 7298930, 'ns': 14, 'title': 'Category:Creative Commons'}, '21140817': {'pageid': 21140817, 'ns': 14, 'title': 'Category:Free communication software'}, '7457597': {'pageid': 7457597, 'ns': 14, 'title': 'Category:Software forks'}, '34474935': {'pageid': 34474935, 'ns': 14, 'title': 'Category:Free software distributions'}, '34482997': {'pageid': 34482997, 'ns': 14, 'title': 'Category:Free-software events'}}}
{'pages': {'34348162': {'pageid': 34348162, 'ns': 14, 'title': 'Category:Free and open-source software licenses'}, '703116': {'pageid': 703116, 'ns': 14, 'title': 'Category:Free software projects'}, '39630965': {'pageid': 39630965, 'ns': 14, 'title': 'Category:History of free and open-source software'}, '1358456': {'pageid': 1358456, 'ns': 14, 'title': 'Category:GNU Project software'}, '34313891': {'pageid': 34313891, 'ns': 14, 'title': 'Category:Free mobile software'}, '6687643': {'pageid': 6687643, 'ns': 14, 'title': 'Category:Free computer programming tools'}, '39401957': {'pageid': 39401957, 'ns': 14, 'title': 'Category:Open-source software hosting facilities'}, '38962158': {'pageid': 38962158, 'ns': 14, 'title': 'Category:Open-source robots'}, '21840815': {'pageid': 21840815, 'ns': 14, 'title': 'Category:Free multilingual software'}, '52773626': {'pageid': 52773626, 'ns': 14, 'title': 'Category:Open source artificial intelligence'}}}
{'pages': {'35912174': {'pageid': 35912174, 'ns': 14, 'title': 'Category:Free technical analysis software'}, '4530452': {'pageid': 4530452, 'ns': 14, 'title': 'Category:Free software stubs'}, '40516443': {'pageid': 40516443, 'ns': 14, 'title': 'Category:Works about free software'}, '49310608': {'pageid': 49310608, 'ns': 14, 'title': 'Category:Public-domain software with source code'}, '952642': {'pageid': 952642, 'ns': 14, 'title': 'Category:Public-domain software'}, '1819021': {'pageid': 1819021, 'ns': 14, 'title': 'Category:Free software websites'}, '46441720': {'pageid': 46441720, 'ns': 14, 'title': 'Category:Free software webmail'}, '36794168': {'pageid': 36794168, 'ns': 14, 'title': 'Category:Free speech synthesis software'}, '6643120': {'pageid': 6643120, 'ns': 14, 'title': 'Category:Free screen readers'}, '34403011': {'pageid': 34403011, 'ns': 14, 'title': 'Category:Open science'}}}

Pythonic beautifulSoup4 : How to get remaining titles from the next page link of a wikipedia category

2 Answers2

Linked