Web scraping pdf using URL Python 3 TypeError

Question

I am trying to code code that downloads a PDF from a URL. I found a method of doing this, but it was not written in Python 3 and used the file() function.

I tried replacing this with open() in the line fp = open(path, 'rb').

However I get this error:

TypeError: expected str, bytes or os.PathLike object, not HTTPResponse.

I cant find a solution online. Any help would be appreciated. Here is the code:

import bs4 as bs
import urllib.request
from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams
from io import StringIO
from io import open

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    stri = retstr.getvalue()
    retstr.close()
    return stri

pdfFile = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf");
outputString = convert_pdf_to_txt(pdfFile)

print(outputString)
pdfFile.close()

Resources used

http://zempirians.com/ebooks/Ryan%20Mitchell-Web%20Scraping%20with%20Python_%20Collecting%20Data%20from%20the%20Modern%20Web-O'Reilly%20Media%20(2015).pdf (page 101)

Extracting text from a PDF file using PDFMiner in python? (the top answer)

If you reference an outside resource in your question, especially one after which your code is closely modeled, it would be helpful to all parties if you linked to that resource. — ndmeiri, Feb 18 '18 at 04:35
Also, please fix the indentation in your posted code. You should always check for correct indentation before posting your question. — ndmeiri, Feb 18 '18 at 04:37
Also, post the full stack trace that you see when the `TypeError` is raised. — ndmeiri, Feb 18 '18 at 04:40

adrtam · Answer 1 · 2018-02-18T17:41:10.493

Do this (you need to get bytes from a HTTP response object):

pdfResponse = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf");
outputString = convert_pdf_to_txt(pdfResponse.read())

See https://docs.python.org/3/library/http.client.html#httpresponse-objects

But then you have to modify your convert_pdf_to_txt function to take raw data as input instead of file object, i.e., instead of

def convert_pdf_to_txt(path):
   fp = open(path, 'rb')
   ...
   for page in PDFPage.get_pages(fp, ...)

You have to do:

def convert_pdf_to_txt(rawbytes):
    import io
    fp = io.BytesIO(rawbytes)
    ...
    for page in PDFPage.get_pages(fp, ...)

io.BytesIO helps you to convert a byte data into file-like byte streams (https://docs.python.org/3/library/io.html#binary-i-o) so you can afterwards pretend that as a file.

I didn't play with the PDF library before, but you may start in this direction.

Hey Thanks heaps for your response, I now get UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc4 in position 10: invalid continuation byte — D.Walsh, Feb 18 '18 at 04:50
sorry, overlooked that part. Did some edit, hope that helps. — adrtam, Feb 18 '18 at 17:41

chb · Answer 2 · 2018-02-18T08:05:23.610

Rather than struggle with an obsolescent version of pdfminer, I'd advise using pdfminer.six which is a more recent fork of the pdfminer library that's compatible with Python 3.

pip install pdfminer.six

You'll have to edit some of the import statements, but for the most part, the newer fork is a drop-in replacement.

So, now, after reading the body of the HTTP response (as per Adrian Tam's advice), you've got a PDF object. You can then call your conversion method with that object as a parameter:

def convert_pdf_to_txt(pdf_obj):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    fp = BytesIO(pdf_obj)  #get a file-like binary object
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    stri = retstr.getvalue()
    retstr.close()
    print(stri)

Web scraping pdf using URL Python 3 TypeError

2 Answers2