0

I am retriving data form an API that returns a JSON object with the following structure:

{ 
  "status":"OK", 
  "text":{ 
    "doc_id":647508, 
    "bill_id":502329, 
    "date":"2012-05-23", 
    "type":"Enrolled", 
    "mime":"application/rtf", 
    "doc":"MIME 64 Encoded Document” 
  } 
} 

where the encoded document is a PDF file. Here is an example of the PDFs I am working with: https://legiscan.com/WA/text/HB1531/id/1473804/Washington-2017-HB1531-Introduced.pdf. I am trying to read the encoded file into a string object. So far I have been able to do so by converting the response into bytes and then reading the pdf :

import PyPDF2
import base64

with open("sample.pdf", "wb") as f:
        inp_str = response.json()['text']['doc'].encode('utf-8')
        f.write(base64.b64decode(inp_str))
    with open('sample.pdf', "rb") as f:
        pdf = PyPDF2.PdfFileReader(f)

It feels that this is not a very efficient way to process multiple documents. I have tried following a related question (Is it possible to input pdf bytes straight into PyPDF2 instead of making a PDF file first):

read_pdf = PyPDF2.PdfFileReader(io.BytesIO(response.json()['text']['doc'].encode()))

but I always get the error PdfReadError: Could not read malformed PDF file

Is there any way to do this?

martineau
  • 119,623
  • 25
  • 170
  • 301
ZMV
  • 35
  • 4
  • 2
    In your attempt using `BytesIO`, you forgot to call `base64.b64decode`! – hobbs Jul 27 '21 at 07:18
  • @hobbs I switched to `read_pdf = PyPDF2.PdfFileReader(io.BytesIO(base64.b64decode(response.json()['text']['doc'])))` but when I try to read the pages I get something like `{'/Type': '/Page', '/Parent': {'/Type': '/Pages', '/Count': 96, '/Kids': [IndirectObject(3, 0), IndirectObject(18, 0), IndirectObject(20, 0), ....` – ZMV Jul 27 '21 at 08:20
  • @hobbs Let me add that this does not happen when I use create and then read the PDF files. – ZMV Jul 27 '21 at 08:30
  • "MIME 64" is not really correct terminology; apparently they mean Base64. – tripleee Jul 27 '21 at 08:38

0 Answers0