Need to Download all .pdf file in given URL using scrapy

Question

**I Tried to Run this scrapy Query to download the all the related PDF from given URL **

I tried to execute this using "scrapy crawl mySpider"

import urlparse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "sec_gov"

    allowed_domains = ["www.sec.gov"]
    start_urls = ["https://secsearch.sec.gov/search?utf8=%3F&affiliate=secsearch&query=exhibit+10"]

    def parse(self, response):
        for href in response.css('div#all_results h3 a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

Can anyone help me with this ? Thanks in Advance.

Please check if the *scrapy.cfg* file exists in the same path from where you are running scrapy crawl command — nilansh bansal, Oct 25 '18 at 12:33
The question looks a copy of https://stackoverflow.com/questions/36135809/using-scrapy-to-to-find-and-download-pdf-files-from-a-website — nilansh bansal, Oct 25 '18 at 12:41
Yes, I have copied the code from this link only. But I tried to run the same code, It shows me an error. — Vinod kumar, Oct 25 '18 at 12:58
Possible duplicate of [Using Scrapy to to find and download pdf files from a website](https://stackoverflow.com/questions/36135809/using-scrapy-to-to-find-and-download-pdf-files-from-a-website) — parik, Oct 31 '18 at 11:27

nilansh bansal · Accepted Answer · 2018-10-31T13:52:55.230

1

Flaws in the code:

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html this url is redirecting to https://www.pwc.com/us/en/services/tax/library.html

Also there is no div with the id all_results so no div#all_results exists in the html response returned to the crawler. So the first line of code in the parse method should generate error.

For the scrapy crawl command to work you should be in a directory where the configuration file scrapy.cfg exists.

Edit: I hope this code helps you. It downloads all the pdfs from the given link.

Code:

#import urllib ---> Comment this line
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
  name = "pwc_tax"

  allowed_domains = ["www.pwc.com"]
  start_urls = ["https://www.pwc.com/us/en/services/consulting/analytics/benchmarking-services.html"]

  def parse(self, response):
    base_url = 'https://www.pwc.com'

    for a in response.xpath('//a[@href]/@href'):
        link = a.extract()
        # self.logger.info(link)

        if link.endswith('.pdf'):
            #link = urllib.parse.urljoin(base_url, link) -> Comment this

            link = base_url + link --> Add this line
            self.logger.info(link)
            yield Request(link, callback=self.save_pdf)

  def save_pdf(self, response):
    path = response.url.split('/')[-1]
    self.logger.info('Saving PDF %s', path)
    with open(path, 'wb') as f:
        f.write(response.body)

The code repository can be found at: https://github.com/NilanshBansal/File_download_Scrapy

edited Oct 31 '18 at 13:52

answered Oct 25 '18 at 12:38

nilansh bansal

1,404
1
12
23

Thanks for your Answer Nilansh, In the folder only the myspider.py file is there. There is no other file is available...., Can you guide me with this pls – Vinod kumar Oct 25 '18 at 13:00
Thanks for you Suggestions Links. I have checked that but I can see that It will reterive data from that site(Zomato). My Goal Is to download all the .PDF files in given URL. Can you pls suggest me a good document for that. – Vinod kumar Oct 29 '18 at 05:43
Thank you very much for this code and I have tried to run this code. and I got this Error. surukam@surukam-Lenovo:~/scrapy/democheck/a$ scrapy crawl myspider.py Scrapy 1.5.1 - no active project Unknown command: crawl Use "scrapy" to see available commands I have saved this file as myspider.py in /scrapy/democheck/a location. and tried to run in using this Query. `scrapy crawl myspider.py` – Vinod kumar Oct 29 '18 at 12:13
Run command **scrapy crawl pwc_tax**, since the name of spider is pwc_tax. – nilansh bansal Oct 29 '18 at 12:27
Yes Bro, I have tried this comment also. but got same ERROR. surukam@surukam-Lenovo:~/scrapy/democheck/a$ scrapy crawl pwc_tax Scrapy 1.5.1 - no active project Unknown command: crawl Use "scrapy" to see available commands – Vinod kumar Oct 29 '18 at 13:06
How have you generated the scrapy project?? I referred you the scrapy docs so that you can remove those configuration errors by yourself. Please remember stack overflow can only help you but it's not a coding service for you. We have given you links please refer to them, I think I have answered your question, If you are facing configuration issues please post some another question or edit this question. The scope of this question doesn't justify your errors! Thanks, and if I was able to resolve your query to some extent kindly accept the answer. Thanks! – nilansh bansal Oct 29 '18 at 13:13
Thanks for your suggestions Nilansh, I check some document that without creating project they have just run the project itself. Now I have Created the scrapy project and executed this query. ** scrapy crawl pwc_tax **. And the program has been executed but i can't see any PDF file has been downloaded in this directory and I checked in downloads folder too. I want to add some query to download the PDF or it has been saved in someother place?? – Vinod kumar Oct 29 '18 at 13:36
Whether I have add something in items.py , pipelines.py and settings.py ? – Vinod kumar Oct 29 '18 at 14:16
No, you don't have to make any changes in items.py and settings.py I have added the code repository its working fine and downloading pdfs – nilansh bansal Oct 29 '18 at 14:33
Thanks you Nilansh, I have downloaded the repository and tried to execute. but i have getting this ERROR and I haven't see any document has been downloaded in this site. `surukam@surukam-Lenovo:~/scrapy/democheck/downloading_files$ scrapy genspider pwc_tax www.pwc.com/us/en/services/consulting/analytics/benchmarking-services.html Spider 'pwc_tax' already exists in module: downloading_files.spiders.pwc_tax ` What should i want to change ? – Vinod kumar Oct 30 '18 at 06:01
Hello Nilansh, I will tell you that I run this program. If there is mismatch pls point out me. 1.I have downloaded the repository and Extracted that.(In that i can see that File_download_scrapy_master and once I opened that i can see that Downloading_files folder and scrapy.cfg file. 2.I have Navigated to upto this folder(cd File_download_scrapy_master/Downloading_files $ 3.I executed this Query : File_download_scrapy_master/Downloading_files $`scrapy crawl pwc_tax`. 4.Code Executed Successfully but i don't see any pdf in that directory. Please Help me with this. – Vinod kumar Oct 30 '18 at 09:11
please upload a screenshot of your terminal output after you run **scrapy crawl pwc_tax** – nilansh bansal Oct 30 '18 at 09:38
Sure, I'll Run once again and I'll upload by nowItself. – Vinod kumar Oct 30 '18 at 09:42
I can't able to upload the screen shots. It has been converting screenshots to URL. I have given the URL below of my screenshot line by line. 1) https://i.stack.imgur.com/gJzA9.png 2) https://i.stack.imgur.com/Y4Obx.png 3) https://i.stack.imgur.com/1CHaP.png – Vinod kumar Oct 30 '18 at 09:57
Thanks for checking my screenshots., I have changed and tried to execute and I got this error. 1) https://i.stack.imgur.com/HTYaj.png – Vinod kumar Oct 30 '18 at 10:14
We are not here to resolve your errors, learn to do google search yourself. I can definitely tell you the answer but I think you are too new to the python language as well. Please learn to resolve the errors by yourself, It's due to your system configuration. I would not be knowing what configuration your system is using. Anyways you can refer to this: https://stackoverflow.com/questions/29358403/no-module-named-urllib-parse-how-should-i-install-it – nilansh bansal Oct 30 '18 at 10:18
Thanks for your Suggestions, Yes Am very new to python and same as scrapy too. I'll learn , My Task is, to download a specific pdf from the url which contains lot of PDF. If you know this please suggest me a good Document, Once again Thank you very much bro. – Vinod kumar Oct 30 '18 at 10:27
I have made changes to the above code please check and make the changes as directed in your code, It should work now – nilansh bansal Oct 30 '18 at 10:28
Thank you very much Bro. Its working.., You saved my Job. I have to download PDF from another website. I'll try to down from my task site. Really thanks a lot. – Vinod kumar Oct 30 '18 at 10:45
I have a last doubt or need a suggestion from you, for downloading a pdf for another website, Just want to change the URL or want to change something else. ?? i mean want to change in XPATH ? – Vinod kumar Oct 30 '18 at 10:46
I have already voted for your Great Answer. Thank you for your kindness and Patience to answer my questions and clearing my doubts. – Vinod kumar Oct 30 '18 at 10:53

score 0 · Answer 2 · answered Oct 25 '18 at 12:33

0

You should run the command inside the directory where scrapy.cfg is present.

answered Oct 25 '18 at 12:33

HariUserX

1,341
1
9
17

Thanks for your answer. I have only this File in that folder. I don't have any other documents or files inside that folder. Or Suggest me how to run this code. – Vinod kumar Oct 29 '18 at 12:21

Need to Download all .pdf file in given URL using scrapy

2 Answers2