Scraping paginated sites and appending output in Python

Question

I have a simple scraping task that I would like to improve the pagination efficiency of, and append lists so that I may output the results of scraping to a common/single file.

The current task is scraping municipal laws for the city of São Paulo, iterating over the first 10 pages. I would like to find a way to determine the total number of pages for pagination, and have the script automatically cycle through all pages, similar in spirit to this: Handling pagination in lxml.

The xpaths for the pagination links are too poorly defined at the moment for me to understand how to do this effectively. For instance, on the first or last page (1 or 1608), there are only three li nodes, while on page page 1605 there are 6 nodes.

/html/body/div/section/ul[2]/li/a

How may I efficiently account for this pagination; making the determination of pages in an automated way rather than manually, and how can I properly specify the xpaths to cycle through all the appropriate pages, without duplicates?

The existing code is as follows:

#! /usr/bin/env python
# -*- coding: utf-8 -*-

import requests  
from lxml import html

base_url = "http://www.leismunicipais.com.br/legislacao-municipal/5298/leis-de-sao-paulo?q=&page=%d&types=o" 
for url in [base_url % i for i in xrange(10)]:
    page = requests.get(url)
    tree = html.fromstring(page.text)

    #This will create a list of titles:
    titles = tree.xpath('/html/body/div/section/ul/li/a/strong/text()')
    #This will create a list of descriptions:
    desc = tree.xpath('/html/body/div/section/ul/li/a/text()')
    #This will create a list of URLs
    url = tree.xpath('/html/body/div/section/ul/li/a/@href')

    print 'Titles: ', titles
    print 'Description: ', desc
    print 'URL: ', url

Secondarily, how can I compile/append these results and write them out to JSON, SQL, etc? I prefer JSON due to familiarity, but am rather ambivalent about how to do this at the moment.

score 0 · Answer 1 · answered Jun 05 '15 at 09:01

0

You'll need to examine the data layout of your page/site. Each site is different. Look for 'pagination' or 'next' or some slider. Extract the details/count and use that in your loop.
import json library. You have a json dump function...

answered Jun 05 '15 at 09:01

zevij

2,416
1
23
32

SIM · Answer 2 · 2017-12-12T07:09:08.040

Although I couldn't understand your problem properly, this code will help invigorate your new attempt. The code is compatible with python 3 and later version.

import requests  
from lxml import html

result = {}
base_url = "https://leismunicipais.com.br/legislacao-municipal/5298/leis-de-sao-paulo?q=&page={0}&types=28&types=5" 
for url in [base_url .format(i) for i in range(1,3)]:
    tree = html.fromstring(requests.get(url).text)
    for title in tree.cssselect(".item-result"):
        try:
            name = ' '.join(title.cssselect(".title a")[0].text.split())
        except Exception:
            name = ""

        try:
            url = ' '.join(title.cssselect(".domain")[0].text.split())
        except Exception:
            url = ""
        result[name] = url

print(result)

Partial output:

{'Decreto 57998/2017': 'http://leismunicipa.is', 'Decreto 58009/2017': 'http://leismunicipa.is'}

Scraping paginated sites and appending output in Python

2 Answers2