I am currently working on a project that involves extracting information from PDF files using Azure Form Recognizer. While I have successfully extracted the text, I am facing an issue with extracting tables. The problem arises because the entire page is being treated as a table due to the presence of page borders.
To overcome this issue, I am attempting to remove the borders from the PDF before sending it to Azure Form Recognizer. I am using the pdfplumber library in Python to extract the rectangles (borders), but I am unable to find a way to modify the PDF and remove these borders.
I have attached an image of a PDF page along with the code snippet I am currently using. Additionally, I have included an image showing the rectangles with the maximum height that I have extracted.
I would greatly appreciate any assistance or suggestions on how to remove the borders from the PDF using Python, or any alternative ideas to achieve the desired outcome.
Code Snippet:
import pdfplumber
import pandas as pd
reader=pdfplumber.open('file.pdf')
pag=reader.pages[5]
df1=pd.DataFrame(pag.rect_edges)
df1=df1[df1['height']!=0.0]
print(df1)