Accelerate scrapy python3 script

Question

I would like to bulk download free-to-download pdfs (copies of an old newspaper from 1843 to 1900 called Gaceta) from this website of the Nicaraguan National Assembly with Python3/Scrapy (see former question here) using the below script:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# A scrapy script to download issues of the Gaceta de Nicaragua (1843-1961)

# virtualenv -p python3 envname
# source envname/bin/activate
# scrapy runspider gaceta_downloader.py

import errno
import json
import os

import scrapy
from scrapy import FormRequest, Request

pwd="/Downloads"
os.chdir(pwd) # this will change directory to pwd path.
print((os.getcwd()))

class AsambleaSpider(scrapy.Spider):
    name = 'asamblea'
    allowed_domains = ['asamblea.gob.ni']
    start_urls = ['http://digesto.asamblea.gob.ni/consultas/coleccion/']

    papers = {
        "Diario Oficial": "28",
    }

    def parse(self, response):

        for key, value in list(self.papers.items()):
            yield FormRequest(url='http://digesto.asamblea.gob.ni/consultas/util/ws/proxy.php',
                  headers= {
                      'X-Requested-With': 'XMLHttpRequest'
                  }, formdata= {
                        'hddQueryType': 'initgetRdds',
                        'cole': value
                    }
                    , meta={'paper': key},
                    callback=self.parse_rdds
                )
        pass

    def parse_rdds(self, response):
        data = json.loads(response.body_as_unicode())
        for r in data["rdds"]:
            r['paper'] = response.meta['paper']
            rddid = r['rddid']
            yield Request("http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=" + rddid,
                          callback=self.download_pdf, meta=r)

    def download_pdf(self, response):
       filename = "{paper}/{anio}/".format(**response.meta) + "{titulo}-{fecPublica}.pdf".format(**response.meta).replace("/", "_")
       if not os.path.exists(os.path.dirname(filename)):
           try:
               os.makedirs(os.path.dirname(filename))
           except OSError as exc:  # guard against race condition
               if exc.errno != errno.EEXIST:
                   raise

       with open(filename, 'wb') as f:
           f.write(response.body)

The script does its job fetching the direct links from a php file and downloading the PDF subsequently, however there are two things still bugging me:

I would like to be able to set the time range of Gacetas I would like to download, i. e. all issues (of available) between 01/01/1844 to 01/01/1900. I tried to figured it out myself to no avail as I am a programming novice.
I would like to accelerate the script. Maybe with xargs? As for now it feels quite slow in execution even though I did not have measured it.

Hi Tarun Lalwani, one of the reason I do this project is to save the Gaceta issues going back to there first issue, which is an important culture heritage. The country is facing some uprising right now and sporadical attacks on their servers. It could be that for a day or so the website might be down or under heavy load. Let's come back to it in a day. If the bounty runs out by then, I will renew it. — Til Hund, May 16 '18 at 08:17
As I am behind an intranet right now, I cannot `tracert` or `ping` to obtain the ip, however it seems that the site is down for good. (Yesterday it worked...) See [link](http://www.isitdownrightnow.com/digesto.asamblea.gob.ni.html). Let's just wait a day or two. — Til Hund, May 16 '18 at 08:27

Armando Garza · Accepted Answer · 2018-05-18T20:08:20.287

Disclaimer: I did not test the script since scrapy requires Microsoft Visual C++ 14.0 and it takes a while to download and install :(

Here's an updated script, I added the date range as start and end and modified the parse_rdds method to only download files in the time frame.

As for optimize it, scrapy is a non-blocking lib and as I understand it should be able to download several files in parallel as it is right now. Keep in mind that you're downloading what it seems a lot of files so, it naturally could take a while.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# A scrapy script to download issues of the Gaceta de Nicaragua (1843-1961)

# virtualenv -p python3 envname
# source envname/bin/activate
# scrapy runspider gaceta_downloader.py

import errno
import json
import os
from datetime import datetime

import scrapy
from scrapy import FormRequest, Request

pwd="/Downloads"
os.chdir(pwd) # this will change directory to pwd path.
print((os.getcwd()))


# date range, format DD/MM/YYYY
start = '16/01/1844'
end = '01/01/1900'

date_format = '%d/%m/%Y'
start = datetime.strptime(start, date_format)
end = datetime.strptime(end, date_format)


class AsambleaSpider(scrapy.Spider):
    name = 'asamblea'
    allowed_domains = ['asamblea.gob.ni']
    start_urls = ['http://digesto.asamblea.gob.ni/consultas/coleccion/']

    papers = {
        "Diario Oficial": "28",
    }

    def parse(self, response):

        for key, value in list(self.papers.items()):
            yield FormRequest(url='http://digesto.asamblea.gob.ni/consultas/util/ws/proxy.php',
                  headers= {
                      'X-Requested-With': 'XMLHttpRequest'
                  }, formdata= {
                        'hddQueryType': 'initgetRdds',
                        'cole': value
                    }
                    , meta={'paper': key},
                    callback=self.parse_rdds
                )
        pass

    def parse_rdds(self, response):
        data = json.loads(response.body_as_unicode())
        for r in data["rdds"]:
            if not r['fecPublica']:
                continue

            r_date = datetime.strptime(r['fecPublica'], date_format)

            if start <= r_date <= end:
                r['paper'] = response.meta['paper']
                rddid = r['rddid']
                yield Request("http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=" + rddid,
                              callback=self.download_pdf, meta=r)

    def download_pdf(self, response):
       filename = "{paper}/{anio}/".format(**response.meta) + "{titulo}-{fecPublica}.pdf".format(**response.meta).replace("/", "_")
       if not os.path.exists(os.path.dirname(filename)):
           try:
               os.makedirs(os.path.dirname(filename))
           except OSError as exc:  # guard against race condition
               if exc.errno != errno.EEXIST:
                   raise

       with open(filename, 'wb') as f:
           f.write(response.body)

Hi Armando Garza, thank you for your script. Unfortunately, I get this error `ValueError: time data '' does not match format '%d/%m/%Y'`. — Til Hund, May 17 '18 at 11:03
Hi, see the edit anwser. I added a simple validation for these cases. — Armando Garza, May 18 '18 at 20:10

Accelerate scrapy python3 script

1 Answers1