How to use tabula on AWS Lambda to read pdf?

Question

`I know that we have to download Java for it to run, I did it on my IDE and it worked. But idk how to download it on the AWS Lambda. If anyone could help me with that I would appreciate it.

I Think the code itself produces what I am expecting, however, the java is what I need.

This is the error I am getting :

`

'[ERROR] JavaNotFoundError: javacommand is not found from this Python process.Please ensure Java is installed and PATH is set forjavaTraceback (most recent call last): File "/var/task/lambda_function.py", line 30, in lambda_handler tables = tabula.read_pdf(io.BytesIO(file_content), pages='all') File "/opt/python/tabula/io.py", line 425, in read_pdf output = _run(java_options, tabula_options, path, encoding) File "/opt/python/tabula/io.py", line 99, in _run raise JavaNotFoundError(JAVA_NOT_FOUND_ERROR)'

            import json
            import boto3
            import pandas as pd
            import io
            import re
            import tabula
            import numpy as np
            def f_remove_accents(old):
                """
            #    Removes common accent characters, lower form.
            #    Uses: regex.
                """
                new = old.lower()
                new = re.sub(r'[àáâãäå]', 'a', new)
                new = re.sub(r'[èéêë]', 'e', new)
                new = re.sub(r'[ìíîï]', 'i', new)
                new = re.sub(r'[òóôõö]', 'o', new)
                new = re.sub(r'[ùúûü]', 'u', new)
                new = re.sub(r'[ç]', 'c', new)
                return new
            def lambda_handler(event, context):
                s3 = boto3.client("s3")
                if event:
                    s3_records = event["Records"][0]
                    bucket_name = str(s3_records["s3"]["bucket"]["name"])
                    file_name = str(s3_records["s3"]["object"]["key"])
                    file_obj = s3.get_object(Bucket=bucket_name, Key=file_name)
                    file_content = file_obj["Body"].read()
            
                    tables = tabula.read_pdf(io.BytesIO(file_content), pages='all')
            
            
                    # Create an empty DataFrame to store all the modified tables
                    modified_tables = []
            
                    # Apply functions to the content of each table
                    for table in tables:
                        # Convert the DataFrame to a NumPy array
                        table_array = table.values.astype(str)
            
                        # Remove accents
                        remove_accents_func = np.vectorize(f_remove_accents)
                        table_array = remove_accents_func(table_array)
            
                        # Replace ';' with ' '
                        table_array = np.char.replace(table_array, ';', ' ')
            
                        # Convert to upper case
                        table_array = np.char.upper(table_array)
            
                        # Create a new DataFrame with the modified array
                        modified_table = pd.DataFrame(table_array, columns=table.columns)
            
                        # Append the modified table to the list
                        modified_tables.append(modified_table)
            
                    # Concatenate all the modified tables into a single DataFrame
                    final_df = pd.concat(modified_tables, ignore_index=True)
            
                    # Save the final DataFrame as a CSV file
                    name_of_return_file = f'{file_name[:-4]}_return.csv'
                    final_df.to_csv(name_of_return_file, sep=';', index=False)
            
                    # Read the CSV file content
                    with open(name_of_return_file, 'rb') as file:
                        csv_content = file.read()
            
                    # Upload the CSV file to the destination bucket
                    s3.put_object(Body=csv_content, Bucket='bucket-recebendo', Key=name_of_return_file)

Have you looked at this previous answer? https://stackoverflow.com/questions/73807086/how-to-use-tabula-in-aws-lambda-to-read-pdf-table - it looks like you might need to create an OCI image — Ricardo Sueiras, Jun 20 '23 at 15:44

Ricardo Sueiras · Answer 1 · 2023-06-21T21:05:11.810

So this is how I managed to get tabular up and running on Lambda. I used the OCI capability of Lambda to package up the requirements.

First of all, I used VSCode to create my folder structure that to put my lambda function and Docker file.

I created a file called "lambda_function.py" that had the code you listed above. I created a pip requirements.txt file that had the following libraries (you should not use this but use pinned versions, I am only doing this for speed)

I did have to change your code, and I made the following changes (import statements)

import json
import boto3
import pandas as pd
import io
import re
from tabula.io import read_pdf
import numpy as np
..
..

as I was getting errors with "AttributeError: module 'tabula' has no attribute 'read_pdf'"

pandas
numpy
tabula-py
boto3

Then I created my Docker file (very rough, not optimised but just to bootstrap this and get it going - you defo want to improve this)

FROM public.ecr.aws/lambda/python:3.10
COPY requirements.txt ${LAMBDA_TASK_ROOT}
COPY lambda_function.py ${LAMBDA_TASK_ROOT}
RUN yum install java-17-amazon-corretto-devel -y
RUN pip install --upgrade pip
RUN pip install --no-cache-dir -r requirements.txt
CMD [ "lambda_function.lambda_handler" ]

This is what my layout looked like:

├── Dockerfile
├── lambda_function.py
└── requirements.txt

I then ran the following commands to build my container image locally and then upload it to amazon ecr (I had created an ecr repo)

docker build -t lambda-tabula:1.0.0 . 
aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin xxxx.dkr.ecr.eu-west-1.amazonaws.com
docker tag lambda-tabula:1.0.0 xxxx.dkr.ecr.eu-west-1.amazonaws.com/lambda-oci-demo:1.0.1
docker push xxxx.dkr.ecr.eu-west-1.amazonaws.com/lambda-oci-demo:1.0.1

I now had my image in ECR, which I could reference via an arn ("xxxxx.dkr.ecr.eu-west-1.amazonaws.com/lambda-oci-demo:1.0.0")

I then created a new Lambda function, specifying OCI image and then pointing to this image. I created a role for the function that would provide permissions to this specific S3 bucket too.

You can grab more details in the docs here

This allows the function to run, although it fails with a different error due to the current function wanting to write to a read only filesystem, so I changed name_of_return_file = f'{file_name[:-4]}_return.csv' to be name_of_return_file = f'/tmp/{file_name[:-4]}_return.csv'

Once I fixed that, it worked great. Here is the updated code

import json
import boto3
import pandas as pd
import io
import re
from tabula.io import read_pdf
import numpy as np
def f_remove_accents(old):
            new = old.lower()
            new = re.sub(r'[àáâãäå]', 'a', new)
            new = re.sub(r'[èéêë]', 'e', new)
            new = re.sub(r'[ìíîï]', 'i', new)
            new = re.sub(r'[òóôõö]', 'o', new)
            new = re.sub(r'[ùúûü]', 'u', new)
            new = re.sub(r'[ç]', 'c', new)
            return new
def lambda_handler(event, context):
            s3 = boto3.client("s3")
            if event:
                    #s3_records = event["Records"][0]
                    #bucket_name = str(s3_records["s3"]["bucket"]["name"])
                    #file_name = str(s3_records["s3"]["object"]["key"])
                    #file_obj = s3.get_object(Bucket=bucket_name, Key=file_name)
                    file_obj = s3.get_object(Bucket="tabula-demo", Key="invoice.pdf")
                    file_content = file_obj["Body"].read()
            
                    tables = read_pdf(io.BytesIO(file_content), pages='all')
            
            
                    # Create an empty DataFrame to store all the modified tables
                    modified_tables = []
            
                    # Apply functions to the content of each table
                    for table in tables:
                        # Convert the DataFrame to a NumPy array
                        table_array = table.values.astype(str)
            
                        # Remove accents
                        remove_accents_func = np.vectorize(f_remove_accents)
                        table_array = remove_accents_func(table_array)
            
                        # Replace ';' with ' '
                        table_array = np.char.replace(table_array, ';', ' ')
            
                        # Convert to upper case
                        table_array = np.char.upper(table_array)
            
                        # Create a new DataFrame with the modified array
                        modified_table = pd.DataFrame(table_array, columns=table.columns)
            
                        # Append the modified table to the list
                        modified_tables.append(modified_table)
            
                    # Concatenate all the modified tables into a single DataFrame
                    final_df = pd.concat(modified_tables, ignore_index=True)
            
                    # Save the final DataFrame as a CSV file
                    #name_of_return_file = f'{file_name[:-4]}_return.csv'
                    name_of_return_file = '/tmp/test_return.csv'
                    final_df.to_csv(name_of_return_file, sep=';', index=False)
            
                    # Read the CSV file content
                    with open(name_of_return_file, 'rb') as file:
                        csv_content = file.read()
            
                    # Upload the CSV file to the destination bucket
                    s3.put_object(Body=csv_content, Bucket='094459-lambda-libs', Key=name_of_return_file)

Note! I had to modify the code above as I hard coded a file as I didnt know what your input files were.

I tried a simple invoice pdf when trying this out, and the output generated was pretty decent. — Ricardo Sueiras, Jun 21 '23 at 21:02

How to use tabula on AWS Lambda to read pdf?

1 Answers1

Linked