Python - Extract data inside a Rectangle Box from a PDF file to CSV file

Question

I want to extract data present inside a rectangle box in a PDF file to a CSV file with corresponding columns and rows. I tried using Camelot, PyPdf2, Tabula libraries etc, but I couldn't get the desired outcome in a CSV file. Could anyone help me here ? I want this data to be published into a CSV file with respective rows and columns.

Below is the data present inside a rectangle box inside a PDF file and link to input PDF file is attached as well:

[enter link description here][2]

[2]: [enter link description here][2]

Below is the code, which I have tried :

import PyPDF2
pdf_file_obj = open('Rectangle_Box_PDF_2021_v2.pdf', 'rb')
pdf_read = PyPDF2.PdfFileReader(pdf_file_obj)
print("The total number of pages : " +str(pdf_read.numPages))
page_obj = pdf_read.getPage(0)
cont = []
pdf_list = [page_obj.extractText()]
print(pdf_list)
list1 = []
pdf_list = [page_obj.extractText()]
for i in range(0, len(pdf_list)):
    list1.append(pdf_list[i].split('\n'))
flatList = sum(list1, [])
print(flatList)

[2]: The pdf file link : https://drive.google.com/file/d/1m1mwO6V9UMuXTddXdkAf0Bx88l9zudcB/view?usp=share_link

The data is text only. The data is present inside a rectangle box in a PDF file. link to the input pdf file : https://drive.google.com/file/d/1m1mwO6V9UMuXTddXdkAf0Bx88l9zudcB/view — Mech_Saran, Nov 04 '22 at 02:27
If you're looking to extract data inside the rectangle by looking for the rectangle, you'll have to use image processing, I'd suggest you try extracting the data by converting the entire PDF to a csv or txt file — The Singularity, Nov 04 '22 at 02:29
I tried with my best, but couldn't do it, could please help me here ? — Mech_Saran, Nov 04 '22 at 02:31
Below is the code which I tried with PyPDF2, I got successful to extract the data inside the rectangle box into a list . After that, I couldn't figure out on how to publish this data into a CSV file ,with respective rows and columns. — Mech_Saran, Nov 04 '22 at 02:39
`import PyPDF2 pdf_file_obj = open('Rectangle_Box_PDF_2021_v2.pdf', 'rb') pdf_read = PyPDF2.PdfFileReader(pdf_file_obj) print("The total number of pages : " +str(pdf_read.numPages)) page_obj = pdf_read.getPage(0) cont = [] pdf_list = [page_obj.extractText()] print(pdf_list) list1 = [] pdf_list = [page_obj.extractText()] for i in range(0, len(pdf_list)): list1.append(pdf_list[i].split('\n')) flatList = sum(list1, []) print(flatList)` — Mech_Saran, Nov 04 '22 at 02:39
Could you help me ? I am desperately trying and waiting for a good solution. — Mech_Saran, Nov 04 '22 at 02:58
https://drive.google.com/file/d/1m1mwO6V9UMuXTddXdkAf0Bx88l9zudcB/view?usp=share_link try this link. — Mech_Saran, Nov 04 '22 at 03:14
Could you help me ? I am desperately trying and waiting for a good solution — Mech_Saran, Nov 06 '22 at 19:45

K J · Answer 1 · 2022-11-08T04:17:46.957

This really is such a poor quality file the data is too badly bult with errors, such that any means to handle it needs a human key board input.

the qr code was refused by several readers

upi://pay?cu=INR
&pa=flipkartinternet@hsbc
&pn=MAHENDRA MAURYA
&gstIn=07AHTPM2207K1Z2
&am=0
&tr=OD124716518958108000
&tn=payOD124716518958108000
&invoiceNo=PZT2204190054Y22YD01
&InvoiceDate=2022-04-19T00:55:16+05:30
&invoiceValue=899.000
&transactionMethod=FLIPKART_FINANCE
&gstBrkUp={GST:96.320|CGST:0|SGST:0|IGST:96.320}

Here it is exported to xlsx, no slight on Aspose (Rubbish In) but not a good result for exporting to CSV

the best you may expect as plain text (with one comma per line) will be

Product
Description
Qty
Gross 

Amount
Discount
Taxable 

Value
IGST
Total

Sadow 40 Meters CAT 6 Ethernet Cable Lan
Network CAT6 Internet Modem RJ45 Patch Cord
40 m LAN Cable Grey | sadow Grey cat6 40mtr |
IMEI/SrNo: [[]]

HSN: 85177090 | IGST: 12%
1
899.00
-0.00
802.68
96.32
899.00

Shipping and Handling
Charges
1
0.00
0
0.00
0.00
0.00

TOTAL QTY: 1
TOTAL PRICE: 899.00 

All values are in INR

u
or ze
gna ure

Better bet is possibly try as HTML then extract that or import to a spreadsheet for csv export.

Here using Adobe export source PDF to XLSX

However best of all was export via xpdf pdftotext and import to excel to save as csv.

Product,Description,Qty,Gross,Discount,Taxable,IGST,Total
,,,Amount,,Value,,

Sadow 40 Meters CAT6 Ethernet Cable Lan,,,,,,,
Network CAT6 Internet Modem RJ45 Patch Cord,HSN: 85177090 | IGST: 12%,1,899.00,-0.00,802.68,96.32,899.00
40 m LAN Cable Grey | sadow Grey cat6 40mtr |,,,,,,,
IMEI/SrNo: [[]],,,,,,,
,Shipping and Handling,1,0.00,0,0.00,0.00,0.00
,Charges,,,,,,
TOTAL QTY: 1,,,,,,TOTAL,PRICE: 899.00
,,,,,,,All values are in INR
,,,,,,u orze,gna ure

Python - Extract data inside a Rectangle Box from a PDF file to CSV file

1 Answers1