I want to extract data present inside a rectangle box in a PDF file to a CSV file with corresponding columns and rows. I tried using Camelot, PyPdf2, Tabula libraries etc, but I couldn't get the desired outcome in a CSV file. Could anyone help me here ? I want this data to be published into a CSV file with respective rows and columns.
Below is the data present inside a rectangle box inside a PDF file and link to input PDF file is attached as well:
[enter link description here][2]
[2]: [enter link description here][2]
Below is the code, which I have tried :
import PyPDF2
pdf_file_obj = open('Rectangle_Box_PDF_2021_v2.pdf', 'rb')
pdf_read = PyPDF2.PdfFileReader(pdf_file_obj)
print("The total number of pages : " +str(pdf_read.numPages))
page_obj = pdf_read.getPage(0)
cont = []
pdf_list = [page_obj.extractText()]
print(pdf_list)
list1 = []
pdf_list = [page_obj.extractText()]
for i in range(0, len(pdf_list)):
list1.append(pdf_list[i].split('\n'))
flatList = sum(list1, [])
print(flatList)
[2]: The pdf file link : https://drive.google.com/file/d/1m1mwO6V9UMuXTddXdkAf0Bx88l9zudcB/view?usp=share_link