How to remove borders from a PDF using Python and pdfplumber for Azure Form Recognizer?

Question

I am currently working on a project that involves extracting information from PDF files using Azure Form Recognizer. While I have successfully extracted the text, I am facing an issue with extracting tables. The problem arises because the entire page is being treated as a table due to the presence of page borders.

To overcome this issue, I am attempting to remove the borders from the PDF before sending it to Azure Form Recognizer. I am using the pdfplumber library in Python to extract the rectangles (borders), but I am unable to find a way to modify the PDF and remove these borders.

I have attached an image of a PDF page along with the code snippet I am currently using. Additionally, I have included an image showing the rectangles with the maximum height that I have extracted.

I would greatly appreciate any assistance or suggestions on how to remove the borders from the PDF using Python, or any alternative ideas to achieve the desired outcome.

Code Snippet:

import pdfplumber
import pandas as pd
reader=pdfplumber.open('file.pdf')
pag=reader.pages[5]
df1=pd.DataFrame(pag.rect_edges)
df1=df1[df1['height']!=0.0]
print(df1)

PDF Page Image:

Rectangles Coordinates Image:

How to remove borders from a PDF using Python and pdfplumber for Azure Form Recognizer?

0 Answers0