0

I want to extract data present inside a rectangle box in a PDF file to a CSV file with corresponding columns and rows. I tried using Camelot, PyPdf2, Tabula libraries etc, but I couldn't get the desired outcome in a CSV file. Could anyone help me here ? I want this data to be published into a CSV file with respective rows and columns.

Below is the data present inside a rectangle box inside a PDF file and link to input PDF file is attached as well: enter image description here

[enter link description here][2]

[2]: [enter link description here][2]

Below is the code, which I have tried :

import PyPDF2
pdf_file_obj = open('Rectangle_Box_PDF_2021_v2.pdf', 'rb')
pdf_read = PyPDF2.PdfFileReader(pdf_file_obj)
print("The total number of pages : " +str(pdf_read.numPages))
page_obj = pdf_read.getPage(0)
cont = []
pdf_list = [page_obj.extractText()]
print(pdf_list)
list1 = []
pdf_list = [page_obj.extractText()]
for i in range(0, len(pdf_list)):
    list1.append(pdf_list[i].split('\n'))
flatList = sum(list1, [])
print(flatList)

[2]: The pdf file link : https://drive.google.com/file/d/1m1mwO6V9UMuXTddXdkAf0Bx88l9zudcB/view?usp=share_link

Mech_Saran
  • 157
  • 1
  • 2
  • 9
  • 1
    Is the data text or an image? – The Singularity Nov 04 '22 at 02:26
  • The data is text only. The data is present inside a rectangle box in a PDF file. link to the input pdf file : https://drive.google.com/file/d/1m1mwO6V9UMuXTddXdkAf0Bx88l9zudcB/view – Mech_Saran Nov 04 '22 at 02:27
  • 1
    If you're looking to extract data inside the rectangle by looking for the rectangle, you'll have to use image processing, I'd suggest you try extracting the data by converting the entire PDF to a csv or txt file – The Singularity Nov 04 '22 at 02:29
  • I tried with my best, but couldn't do it, could please help me here ? – Mech_Saran Nov 04 '22 at 02:31
  • 1
    What have you tried with PyPDF2? – The Singularity Nov 04 '22 at 02:33
  • Below is the code which I tried with PyPDF2, I got successful to extract the data inside the rectangle box into a list . After that, I couldn't figure out on how to publish this data into a CSV file ,with respective rows and columns. – Mech_Saran Nov 04 '22 at 02:39
  • `import PyPDF2 pdf_file_obj = open('Rectangle_Box_PDF_2021_v2.pdf', 'rb') pdf_read = PyPDF2.PdfFileReader(pdf_file_obj) print("The total number of pages : " +str(pdf_read.numPages)) page_obj = pdf_read.getPage(0) cont = [] pdf_list = [page_obj.extractText()] print(pdf_list) list1 = [] pdf_list = [page_obj.extractText()] for i in range(0, len(pdf_list)): list1.append(pdf_list[i].split('\n')) flatList = sum(list1, []) print(flatList)` – Mech_Saran Nov 04 '22 at 02:39
  • 2
    Post the code in your question – The Singularity Nov 04 '22 at 02:40
  • Could you help me ? I am desperately trying and waiting for a good solution. – Mech_Saran Nov 04 '22 at 02:58
  • https://drive.google.com/file/d/1m1mwO6V9UMuXTddXdkAf0Bx88l9zudcB/view?usp=share_link try this link. – Mech_Saran Nov 04 '22 at 03:14
  • Could you help me ? I am desperately trying and waiting for a good solution – Mech_Saran Nov 06 '22 at 19:45

1 Answers1

0

This really is such a poor quality file the data is too badly bult with errors, such that any means to handle it needs a human key board input.

the qr code was refused by several readers

upi://pay?cu=INR
&pa=flipkartinternet@hsbc
&pn=MAHENDRA MAURYA
&gstIn=07AHTPM2207K1Z2
&am=0
&tr=OD124716518958108000
&tn=payOD124716518958108000
&invoiceNo=PZT2204190054Y22YD01
&InvoiceDate=2022-04-19T00:55:16+05:30
&invoiceValue=899.000
&transactionMethod=FLIPKART_FINANCE
&gstBrkUp={GST:96.320|CGST:0|SGST:0|IGST:96.320}

Here it is exported to xlsx, no slight on Aspose (Rubbish In) but not a good result for exporting to CSV

enter image description here

the best you may expect as plain text (with one comma per line) will be

Product
Description
Qty
Gross 

Amount
Discount
Taxable 

Value
IGST
Total

Sadow 40 Meters CAT 6 Ethernet Cable Lan
Network CAT6 Internet Modem RJ45 Patch Cord
40 m LAN Cable Grey | sadow Grey cat6 40mtr |
IMEI/SrNo: [[]]

HSN: 85177090 | IGST: 12%
1
899.00
-0.00
802.68
96.32
899.00

Shipping and Handling
Charges
1
0.00
0
0.00
0.00
0.00

TOTAL QTY: 1
TOTAL PRICE: 899.00 

All values are in INR

u
or ze
gna ure

Better bet is possibly try as HTML then extract that or import to a spreadsheet for csv export.

enter image description here

Here using Adobe export source PDF to XLSX enter image description here

However best of all was export via xpdf pdftotext and import to excel to save as csv.

Product,Description,Qty,Gross,Discount,Taxable,IGST,Total
,,,Amount,,Value,,

Sadow 40 Meters CAT6 Ethernet Cable Lan,,,,,,,
Network CAT6 Internet Modem RJ45 Patch Cord,HSN: 85177090 | IGST: 12%,1,899.00,-0.00,802.68,96.32,899.00
40 m LAN Cable Grey | sadow Grey cat6 40mtr |,,,,,,,
IMEI/SrNo: [[]],,,,,,,
,Shipping and Handling,1,0.00,0,0.00,0.00,0.00
,Charges,,,,,,
TOTAL QTY: 1,,,,,,TOTAL,PRICE: 899.00
,,,,,,,All values are in INR
,,,,,,u orze,gna ure

enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36