Creating an AWS lambda function to split pdf files in a s3 bucket

Question

I want to write an AWS Lambda function that:

Takes pdf file from s3 bucket -> splits the pdf file -> Stores split files to S3 bucket.
I am using PyPDF module, so need to know how I can use it in aws lambda function as well.

The code to split pdf files:

import os
from PyPDF2 import PdfFileReader, PdfFileWriter

pdf_file_path = 'filename.pdf'
file_base_name = pdf_file_path.replace('.pdf','')
output_folder_path = os.path.join(os.getcwd(), 'output')

pdf = PdfFileReader(pdf_file_path)

for page_num in range(pdf.numPages):
    pdfWriter = PdfFileWriter()
    pdfWriter.addPage(pdf.getPage(page_num))

    with open(os.path.join(output_folder_path, '{0}_Page{1}.pdf'.format(file_base_name,page_num+1)), 'wb') as f:
        pdfWriter.write(f)
        f.close()

What should be my lambda function for this?(The code)

You can use `/tmp` to write the PDF to and then move it to S3, if this is your question. — Diego Torres Milano, Apr 10 '22 at 00:03

RodP · Answer 1 · 2022-04-10T08:19:30.187

Your lambda code needs to look something like this. In this case I am reading a S3 file using boto3. You pass arguments to your lambda function in the event.

import boto3
from content_reader_lambda.pdf import reader

def read_pdf_from_bucket(event, context):
    bucket_name = event['bucket_name']
    file_name = event['file_name']
    s3 = boto3.resource('s3')
    obj = s3.Object(bucket_name, file_name)
    s3_file = obj.get()['Body'].read()
    return reader.pdf_as_text(s3_file, 'pdf')

I am using pymupdf to read the PDF and return text like this.

def pdf_as_text(file_stream, filetype):
    text = ''
    with fitz.open(stream=file_stream, filetype=filetype) as doc:
        for page in doc:
            # Sort reads the text in display/reading order.  https://pymupdf.readthedocs.io/en/latest/page.html#Page.get_textpage
            text+= page.get_text('text', sort=True)
    return text

You can replace that with your code and use boto3 to write your PDF back to S3.

Deploying your lambda to AWS together with the third party libraries that you use is a whole different topic. For that I suggest using layers. Smaller libraries are a lot easier to deploy given AWS size limits.

score 0 · Answer 2 · answered Dec 26 '22 at 09:07

pypdf can work with filestreams (docs):

Reading:

from io import BytesIO

# Prepare example
with open("example.pdf", "rb") as fh:
    bytes_stream = BytesIO(fh.read())

# Read from bytes_stream
reader = PdfReader(bytes_stream)

# Write to bytes_stream
writer = PdfWriter()
with BytesIO() as bytes_stream:
    writer.write(bytes_stream)

Writing:

from io import BytesIO

import boto3
from pypdf import PdfReader, PdfWriter


reader = PdfReader(BytesIO(raw_bytes_data))
writer = PdfWriter()

# Add all pages to the writer
for page in reader.pages:
    writer.add_page(page)

# Add a password to the new PDF
writer.encrypt("my-secret-password")

# Save the new PDF to a file
with BytesIO() as bytes_stream:
    writer.write(bytes_stream)
    bytes_stream.seek(0)
    s3 = boto3.client("s3")
    s3.write_get_object_response(
        Body=bytes_stream, RequestRoute=request_route, RequestToken=request_token
    )

Creating an AWS lambda function to split pdf files in a s3 bucket

2 Answers2