2

I want to convert PDF file into CSV or XLS. I tried doing this by using python tabula:

#!/bin/bash
#!/usr/bin/env python3
import tabula

# Read pdf into list of DataFrame
df = tabula.read_pdf("File1.pdf", pages='all')

# convert PDF into CSV file
tabula.convert_into("File1.pdf", "File1.csv", output_format="csv", pages='all')

# convert all PDFs in a directory
#tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')

Although python script convert PDF to CSV, decimal is not correct.

e.g 1.25 shows up as 1.2 only.

So I want to increase the decimal place to TWO so that I will get correct numbers in the converted CSV file.

Can someone please help me with it?

Thank you.

linux01
  • 41
  • 2
  • 7
  • need to explore area and columns parameters. something like this `df = tabula.read_pdf(pdf_file, pages='all',area=(0, 8, 800, 840),columns=[91,269,380,470,520,580,657]`. Again, it depends on pdf. if you could share the PDF, then we can take a look. – simpleApp Oct 20 '21 at 12:56
  • I have 12 columns in my PDF file. 1st four rows of the file are headings. – linux01 Oct 21 '21 at 14:31
  • Now I am able to get the decimal upto 2 but the format of csv file is weird. – linux01 Oct 22 '21 at 00:00
  • Now I am able to get the decimal upto 2 but the format of csv file is weird. The PDF file has 2 pages. The first page has 1st 3 lines included headers and 4th line is respective heading of 12 columns. The Second page does not have any headings for the columns. The converted CSV shows column1, column2 3 4 5 6 7 8 9, column10, column11, column12. Whereas 2nd page shows up correctly like column1, column2 . . . column12. In the 1st page column2 to 9 are showing up as single column. Please suggest how can I correct this ? – linux01 Oct 22 '21 at 00:09
  • pls share the PDF and also update the code you have so far. – simpleApp Oct 22 '21 at 00:34
  • ```#!/usr/bin/env python3 import tabula # Read pdf into list of DataFrame #df = tabula.read_pdf("21HJASP017.pdf", pages='all') #df = tabula.read_pdf("21HJASP017.pdf", pages='all',area=(0, 8, 800, 840),columns=[91,269,380,470,520,580,657]) df = tabula.read_pdf("21HJASP017.pdf", pages='all')[0] # convert PDF into CSV file tabula.convert_into("21HJASP017.pdf", "21HJASP017_22thOct2021_01.csv", output_format="csv", pages='all') # convert all PDFs in a directory #tabula.convert_into_by_batch("input_directory", output_format='csv', pages='all')``` – linux01 Oct 22 '21 at 00:55
  • @simpleApp Please find the PDf file below: https://drive.google.com/file/d/1a-yLIUNER6x892g6Iu2g6_a2l3ICfK_F/view?usp=sharing Code I have mentioned in my previous comment. Please help – linux01 Oct 24 '21 at 12:29

1 Answers1

2

Based on the need, we need to tweak the parameters on tabula so that data import makes sense. The parameters I suggested in the comments were just an example. To get columns starting x-axis either we need to use acrobat's paid version or with some trails.

so code would be like

Import and setup

import tabula
import pandas as pd
pdf_file='file1.pdf'
column_names=['Product','Batch No','Machin No','Time','Date','Drum/Bag No','Tare Wt.kg','Gross Wt.kg',
              'Net Wt.kg','Blender','Remarks','Operator']
df_results=[] # store results in a list

as pages are not in the same format, we need to process them separately. And some clean up like, remove the column which is not needed or data after certain value(refer in page 2 processing)

# Page 1 processing
try:
    df1 = tabula.read_pdf(pdf_file, pages=1,area=(95,20, 800, 840),columns=[93,180,220,252,310,315,333,367,
                                                                          410,450,480,520]
                         ,pandas_options={'header': None}) #(top,left,bottom,right)
    df1[0]=df1[0].drop(columns=5)
    df1[0].columns=column_names
    df_results.append(df1[0])
    df1[0].head(2)
    
except Exception as e:
    print(f"Exception page not found {e}")
# Page 2 processing
try:
    df2 = tabula.read_pdf(pdf_file, pages=3,area=(10,20, 800, 840),columns=[93,180,220,252,310,315,330,370,
                                                                          410,450,480,520]
                         ,pandas_options={'header': None}) #(top,left,bottom,right)

    row_with_Sta = df2[0][df2[0][0] == 'Sta'].index.tolist()[0]
    df2[0] = df2[0].iloc[:row_with_Sta]
    df2[0]=df2[0].drop(columns=5)
    df2[0].columns=column_names
    df_results.append(df2[0])
    df2[0].head(2)
except Exception as e:
    print(f"Exception page not found {e}")
#result = pd.concat([df1[0],df2[0]]) # concate both the pages and then write to CSV
result = pd.concat(df_results) # concate list of pages and then write to CSV
result.to_csv("result.csv")

Note: pls test the code, as I have some level of verification only :)

simpleApp
  • 2,885
  • 2
  • 10
  • 19
  • Yes now the final csv looks good however it still needs to be tunes for below mentioned points: A) I want to retain the original look of PDF file with respect to heading in the result.csv B) Need to eliminate serial number column C) if the number is 9.30 its displayed as 9.3, I want it to be displayed with 2 decimal numbers 9.30 like that How can we move ahead on this – linux01 Oct 26 '21 at 10:10
  • I have created bash shell script to put all these together such that when ever there is PDf file in teh directoy it should convert it to csv ```sed -e "s/Input_PDF_FileName/$pdffilename/" -e "s/Output_CSV_FileName/$filename/g" -e "s/final_outputfile/$filename/" /root/scripts/pdf2xls/python_pdf2xls.py /usr/bin/python3 /root/scripts/pdf2xls/python_pdf2xls.py ``` – linux01 Oct 27 '21 at 14:02
  • Traceback (most recent call last): File "/root/scripts/pdf2xls/python_pdf2xls.py", line 14, in ,pandas_options={'header': None}) #(top,left,bottom,right) File "/usr/local/lib/python3.7/dist-packages/tabula/io.py", line 314, in read_pdf raise FileNotFoundError(errno.ENOENT, os.strerror(errno.ENOENT), path) FileNotFoundError: [Errno 2] No such file or directory: '/pdf2xls/Input_PDF_FileName' – linux01 Oct 27 '21 at 14:04
  • why is it not accepting the replaced file path in python script – linux01 Oct 27 '21 at 14:05
  • pls post this as a new question. – simpleApp Oct 27 '21 at 14:22
  • When I am executing this python script in Bash shell Script, I am getting below error: ``` Traceback (most recent call last): File "/root/scripts/pdf2xls/python_pdf2xls.py", line 36, in csv_output.writerow(next(csv_input)) # write header StopIteration ``` – linux01 Oct 28 '21 at 04:59
  • I am now getting below error when I run the python script which you gave me: ` Error from tabula-java: Exception in thread "main" java.lang.IndexOutOfBoundsException: Page number does not exist.` – linux01 Oct 28 '21 at 10:51
  • I have checked Java: # java -version openjdk version "11.0.12" 2021-07-20 OpenJDK Runtime Environment (build 11.0.12+7-post-Raspbian-2deb10u1) OpenJDK Server VM (build 11.0.12+7-post-Raspbian-2deb10u1, mixed mode) Python tabula-py is also installed how to fix this error please help – linux01 Oct 28 '21 at 11:15
  • I did reinstalled everything still the error remains same. Please guide me how can I fix it – linux01 Oct 28 '21 at 14:29
  • pls, try to zero down on the issue. try to run python script only if that works well then the issue is with bash script. Maybe it's not reading the PDF file and that is the reason that getting Page number does not exist error. – simpleApp Oct 28 '21 at 17:11
  • yes because the PDF file were having either single page or more than two pages, the python was not able to read those pages and hence throwing error. How can we modify this script so that it can support single or multiple pages in the PDF to be converted to CSV. – linux01 Oct 29 '21 at 10:30
  • please suggest how can I move ahead on this? I am very much new to python. Please guide – linux01 Nov 01 '21 at 17:11
  • pls try [multiple pages](https://stackoverflow.com/questions/42538292/extracting-tables-from-pdfs-using-tabula) – simpleApp Nov 01 '21 at 18:12
  • I am unable to put together the condition for the script mentioned above. Can you please help? I am naive in python. – linux01 Nov 09 '21 at 12:46
  • I would recommend that pls go through the python tutorial on error handling .made some changes to the answer. – simpleApp Nov 09 '21 at 14:00
  • I am still not able to find out what to deal with the issue of multiple PDF pages. please guide me – linux01 Nov 18 '21 at 10:31
  • how can we expand 11th column boundaries so that it can accommodate words like "Report" and any Name with more characters. Because right now if 11th column has "Report" word, the output CSV has 11th column as "Repor" and T is included in 12th column. – linux01 Dec 31 '21 at 05:44