0

I am trying to parse a pdf into dataframe using camelot

import camelot
import pandas as pd

file = 'foo.pdf'
tables = camelot.read_pdf(file, pages='2', flavor='stream')

v = []
for i, table in enumerate(tables):
    v.append(table.df)
w = pd.concat(v)

print(w)

enter image description here

however, its reading as below:

7                    Customer No.                           Document Date     Customer PO No.  External Doc. No.\nPayment Terms              
8                          126207                                28/02/22                                      STRICTLY 14 DAYS              
9                                                                               PO No./Docket         Unit Price \nAmount \nGST  Amount Incl.
10                    Description                                                   TASK DATE                      Quantity UOM              
11                                                                                        No.      Excl. GST\nExcl. GST\nAmount           GST
12                 BOC GAS & GEAR                                                                                                            
13                 11 SNOW STREET                                                                                                            
14       SOUTH LISMORE, NSW  2480                                                                                                            
15  CLEAR: FL 1.5M3 BIN-CARDBOARD                                                    02/02/22           1\nEA\n9.18\n9.18\n0.92         10.10
16  CLEAR: FL 1.5M3 BIN-CARDBOARD                                                    16/02/22           1\nEA\n9.18\n9.18\n0.92         10.10

How do I avoid the newline \n when reading the pdf?

leonardo
  • 140
  • 10
  • True. How can I identify the frames? Can `camelot` help here? Or can a CNN help? I just dont know how can I train a CNN for this job. – leonardo May 21 '22 at 21:11
  • ahha I huess the best way to go about would be to take it row wise? e.g. take first x number of rows and treat them and then the next set with other rules? – leonardo May 21 '22 at 22:10
  • How can I process the same? I cannot break a PDF into the rows and niether can I make Camelot scan regions seperately (If I am not wrong) – leonardo May 22 '22 at 04:29

0 Answers0