0

I have pdf files where I want to extract info only from the first page. My solution is to:

  1. Use PyPDF2 to read from S3 and save only the first page.
  2. Read the same one-paged-pdf I saved, convert to byte64 and analyse it on AWS Textract.

It works but I do not like this solution. What is the need to save and still read the exact same file? Can I not use the file directly at runtime?

Here is what I have done that I don't like:

from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO
import boto3

def analyse_first_page(bucket_name, file_name):
    s3 = boto3.resource("s3")
    obj = s3.Object(bucket_name, file_name)
    fs = obj.get()['Body'].read()
    pdf = PdfReader(BytesIO(fs), strict=False)
    writer = PdfWriter()
    page = pdf.pages[0]
    writer.add_page(page)
    
    # Here is the part I do not like
    with open("first_page.pdf", "wb") as output:
        writer.write(output)

    with open("first_page.pdf", "rb") as pdf_file:
        encoded_string = bytearray(pdf_file.read())

    #Analyse text
    textract = boto3.client('textract')
    response = textract.detect_document_text(Document={"Bytes": encoded_string})

    return response

analyse_first_page(bucket, file_name)

Is there no AWS way to do this? Is there no better way to do this?

trazoM
  • 50
  • 1
  • 8
  • 2
    You clearly haven't run this code (there is no `DetectDocumentText` method on the textract client) so you should fix the code and run it first. You can provide the file contents in `Bytes` as an alternative to providing an S3 object reference in `S3Object`. – jarmod Jan 18 '23 at 13:18
  • 1
    @jarmod, Sorry, my bad. I have modified with a working version. While making my code work, I found out that my solution exists in https://stackoverflow.com/questions/68985391/writing-a-python-pdfrw-pdfreader-object-to-an-array-of-bytes-filestream Thank you. – trazoM Jan 18 '23 at 13:27

1 Answers1

1

You can use BytesIO as stream in memory without write to file then read it again.

with BytesIO() as bytes_stream:
    writer.write(bytes_stream)
    bytes_stream.seek(0)
    encoded_string = b64encode(bytes_stream.getvalue())

Mehmet Güngören
  • 2,383
  • 1
  • 9
  • 16
  • It worked straight away but without the `b64encode()`. Thanks. – trazoM Jan 18 '23 at 13:43
  • 1
    @trazoM This makes sense. `If you're using an AWS SDK to call Amazon Textract, you might not need to base64-encode image bytes that are passed using the Bytes field.` https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.detect_document_text – jellycsc Jan 18 '23 at 16:05