Bulk download of pdf with Scrapy and Python3

Question

I would like to bulk download free-to-download pdfs (copies of an old newspaper from 1843 to 1900 called Gaceta) from this website of the Nicaraguan National Assembly with Python3/Scrapy.

I am a absolute beginner in programming and python, but tried to start with a(n unfinished) script:

#!/usr/bin/env python3

from urllib.parse import urlparse
import scrapy

from scrapy.http import Request

class gaceta(scrapy.Spider):
    name = "gaceta"

    allowed_domains = ["digesto.asamblea.gob.ni"]
    start_urls = ["http://digesto.asamblea.gob.ni/consultas/coleccion/"]

    def parse(self, response):
        for href in response.css('div#gridTableDocCollection::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

The link to each issue has some gibberish in it, so they cannot be anticipated and each link has to be searched within the source code, see for example the links to the first four available issues of the said newspaper (not every days there was a copy issued):

#06/07/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=nYgT5Rcvs2I%3D

#13/07/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=3sAxsKCA6Bo%3D

#28/07/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=137YSPeIXg8%3D

#08/08/1843
http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=aTvB%2BZpqoMw%3D

My problem is that I cannot get a working script together.

I would like to have my script to:

a) search each pdf link within the table which appears after a search (called within the website source code "tableDocCollection"). The actual link sits behind the "Acciones" button (xpath of the first issue //*[@id="tableDocCollection"]/tbody/tr[1]/td[5]/div/ul/li[1]/a)

b) to display the name of the issue it is downloading and which can be found behind the "Acciones" button (path of the name to be displayed of the first issue //*[@id="tableDocCollection"]/tbody/tr[1]/td[5]/div/ul/li[2]/a).

The major problems I run into when writing the script are:

1) that the link of the website does not change when I type in the search. So it seems that I have to tell Scrapy to insert the appropriate search terms (check mark "Búsqueda avanzada", "Colección: Dario Oficial", "Medio de Publicación: La Gaceta", time interval "06/07/1843 to 31/12/1900")?

2) that I do not know how each pdf link can be found?

How can I update the above script so that I can download all PDF in the range of 06/07/1843 to 31/12/1900?

Edit:

#!/usr/bin/env python3
from urllib.parse import urlparse
import scrapy

from scrapy.http import Request

frmdata = {"rdds":[{"rddid":"+1RiQw3IehE=","anio":"","fecPublica":"","numPublica":"","titulo":"","paginicia":null,"norma":null,"totalRegistros":"10"}
url = "http://digesto.asamblea.gob.ni/consultas/coleccion/"
r = FormRequest(url, formdata=frmdata)
fetch(r)

yield FormRequest(url, callback=self.parse, formdata=frmdata)

You would be better off using Python/Scrapy for this. Wget may not be the easiest option to get everything right — Tarun Lalwani, May 03 '18 at 18:54
Hi Tarun Lalwani, can you give me a code headstart for Scrapy? — Til Hund, May 03 '18 at 22:18
@Tarun Lalwani, I took your advise by heart and started a `Scrapy` script, see above. — Til Hund, May 04 '18 at 07:42
If you look at the network traffic then there is a `POST` call to `proxy.php` with some code for the category https://i.stack.imgur.com/8fc2A.png and then there is response which has all the ID for all years https://i.stack.imgur.com/rIhkI.png. You need to hit this `proxy.php` POST using scrapy in your code — Tarun Lalwani, May 04 '18 at 08:37
Thanks for commenting again, Tarun, I tried it with examples like [this](https://stackoverflow.com/questions/30342243/send-post-request-in-scrapy?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa) but did not get it working. I am not getting the script to work. :( — Til Hund, May 04 '18 at 11:16

score 1 · Accepted Answer · edited May 06 '18 at 02:13

# -*- coding: utf-8 -*-
import errno
import json
import os

import scrapy
from scrapy import FormRequest, Request


class AsambleaSpider(scrapy.Spider):
    name = 'asamblea'
    allowed_domains = ['asamblea.gob.ni']
    start_urls = ['http://digesto.asamblea.gob.ni/consultas/coleccion/']

    papers = {
    #    "Diario de Circulación Nacional" : "176",
        "Diario Oficial": "28",
    #    "Obra Bibliográfica": "31",
    #    "Otro": "177",
    #    "Texto de Instrumentos Internacionales": "103"
    }

    def parse(self, response):

        for key, value in list(self.papers.items()):
            yield FormRequest(url='http://digesto.asamblea.gob.ni/consultas/util/ws/proxy.php',
                  headers= {
                      'X-Requested-With': 'XMLHttpRequest'
                  }, formdata= {
                        'hddQueryType': 'initgetRdds',
                        'cole': value
                    }
                    , meta={'paper': key},
                    callback=self.parse_rdds
                )
        pass

    def parse_rdds(self, response):
        data = json.loads(response.body_as_unicode())
        for r in data["rdds"]:
            r['paper'] = response.meta['paper']
            rddid = r['rddid']
            yield Request("http://digesto.asamblea.gob.ni/consultas/util/pdf.php?type=rdd&rdd=" + rddid,
                          callback=self.download_pdf, meta=r)

    def download_pdf(self, response):
       filename = "{paper}/{anio}/".format(**response.meta) + "{titulo}-{fecPublica}.pdf".format(**response.meta).replace("/", "_")
       if not os.path.exists(os.path.dirname(filename)):
           try:
               os.makedirs(os.path.dirname(filename))
           except OSError as exc:  # Guard against race condition
               if exc.errno != errno.EEXIST:
                   raise

       with open(filename, 'wb') as f:
           f.write(response.body)

My laptop is out for repair, and on the spare Windows laptop I am not able to install Scrapy with Python3. But I am pretty sure this should do job

Hi Tarun Lalwani for your quick reply. The scripts runs through, but without downloading any `Gaceta` from `Dario Oficial`, see `[scrapy.extensions.logstats] INFO: Crawled 11 pages (at 11 pages/min), scraped 0 items (at 0 items/min)`. By the way, it runs with python3. Only line 25 had to be changed to `for key, value in list(self.papers.items()):`. — Til Hund, May 05 '18 at 16:56

Bulk download of pdf with Scrapy and Python3

1 Answers1

Linked