I am retriving data form an API that returns a JSON object with the following structure:
{
"status":"OK",
"text":{
"doc_id":647508,
"bill_id":502329,
"date":"2012-05-23",
"type":"Enrolled",
"mime":"application/rtf",
"doc":"MIME 64 Encoded Document”
}
}
where the encoded document is a PDF file. Here is an example of the PDFs I am working with: https://legiscan.com/WA/text/HB1531/id/1473804/Washington-2017-HB1531-Introduced.pdf. I am trying to read the encoded file into a string object. So far I have been able to do so by converting the response into bytes and then reading the pdf :
import PyPDF2
import base64
with open("sample.pdf", "wb") as f:
inp_str = response.json()['text']['doc'].encode('utf-8')
f.write(base64.b64decode(inp_str))
with open('sample.pdf', "rb") as f:
pdf = PyPDF2.PdfFileReader(f)
It feels that this is not a very efficient way to process multiple documents. I have tried following a related question (Is it possible to input pdf bytes straight into PyPDF2 instead of making a PDF file first):
read_pdf = PyPDF2.PdfFileReader(io.BytesIO(response.json()['text']['doc'].encode()))
but I always get the error PdfReadError: Could not read malformed PDF file
Is there any way to do this?