2

I tried using Scrapy with PyPDF2 library to crawl PDfs online unsuccessfully. So far I'm able to navigate all links and able to grab the PDf files, but feeding them through PyPDF2 seems to be a problem.

Note: my goal is not to grab/save PDF files, I intend to parse them by first converting PDF to text and then manipulating this text using other methods.

For brevity, I did not include the entire code here. Here's part of my code:

import io
import re
import PyPDF2
import scrapy
from scrapy.item import Item

class ArticleSpider(scrapy.Spider):
    name = "spyder_ARTICLE"                                                 
    start_urls = ['https://legion-216909.appspot.com/content.htm']                                                                      

    def parse(self, response):                                              
        for article_url in response.xpath('//div//a/@href').extract():      
            yield response.follow(article_url, callback=self.parse_pdf) 

    def parse_pdf(self, response):
        """ Peek inside PDF to check for targets.
        @return: PDF content as searcable plain-text string
        """
        reader = PyPDF2.PdfFileReader(response.body)
        text = u""

        # Title is optional, may be None
        if reader.getDocumentInfo().title: text += reader.getDocumentInfo().title
        # XXX: Does handle unicode properly?
        for page in reader.pages: text += page.extractText()

        return text

Each time I run the code, the spider attempts reader = PyPDF2.PdfFileReader(response.body) and gives the following error: AttributeError: 'bytes' object has no attribute 'seek'

What am I doing wrong?

Code Monkey
  • 800
  • 1
  • 9
  • 27
  • Reread [docs.scrapy.org/en/latest/topics/spiders.html](https://docs.scrapy.org/en/latest/topics/spiders.html). You didn't understand the **callback function**. – stovfl Sep 26 '18 at 10:45
  • "3. In callback functions, you parse the page contents, typically using Selectors (but you can also use BeautifulSoup, lxml or whatever mechanism you prefer) and generate items with the parsed data." – Code Monkey Sep 26 '18 at 11:16
  • Do a `print(response.body)` and see what you get. – stovfl Sep 26 '18 at 11:19
  • Possible duplicate of [Using Scrapy to to find and download pdf files from a website](https://stackoverflow.com/questions/36135809/using-scrapy-to-to-find-and-download-pdf-files-from-a-website) – stovfl Sep 26 '18 at 11:23
  • Using all caps is considered as shouting on the internet. Kindly use other methods like bold or italics to highlight a portion. – Anuvrat Parashar Sep 26 '18 at 12:06

1 Answers1

7

That does not seem to be a problem with scrapy. PyPDF2 is expecting a stream of binary data.

# use this instead of passing response.body directly into PyPDF2
reader = PyPDF2.PdfFileReader(io.BytesIO(response.body))

Hope this helps.

Anuvrat Parashar
  • 2,960
  • 5
  • 28
  • 55