0

I am trying to use tabula-py to extract data from a PDf and save it to a csv. The PDF contains a work order. The data in the PDF is not formatted in a usable table - I am required to use Stream mode.

Through the Tabula web interface, I have created a template and can use it to extract the data to a csv.
My template includes 8 sections.
The csv export from Tabula looks like this: (without a header row though...)
| job number| | ---- | | full name | | phone1 | | phone2 | | address-line-1 | | address-line-2 | | email | | jobtype1 | | jobtype2 |

Using tabula-py, I can run the code df = tabula.read_pdf_with_template("file.pdf","template.tabula-template.json")
The output of print(df) is:

[Empty DataFrame
Columns: [full name]
Index: [], Empty DataFrame
Columns: [phone1]
Index: [], Empty DataFrame
Columns: [phone2]
Index: [], Empty DataFrame
Columns: [email]
Index: [],   address-line-1
0   address-line-2
Columns: [jobtype]
Index: [], Empty DataFrame
Columns: [jobtype2]
Index: [], Empty DataFrame
Columns: [jobnumber]
Index: []]

Note: I have changed the output text to the field name

None of the sections that I have selected have a header. Notice the address section shows the address line 1 as the header and address line 2 as the value. This is due to the address being on 2 lines in the PDF.

I have 3 questions:

  1. How do I get the address to be on one line?
  2. How do I export the output to csv? I know I have to loop through the DataFrame(s) to output one at a time, but python is not my strong suit
  3. I have over 3000 work orders to process. Is it possible to have the extract appended to the same csv file?

All code used to produce the above results:

import tabula
import pandas as pd
df = tabula.read_pdf_with_template("file.pdf","template.tabula-template.json")
print(df)

0 Answers0