I am trying to use tabula-py to extract data from a PDf and save it to a csv. The PDF contains a work order. The data in the PDF is not formatted in a usable table - I am required to use Stream mode.
Through the Tabula web interface, I have created a template and can use it to extract the data to a csv.
My template includes 8 sections.
The csv export from Tabula looks like this: (without a header row though...)
| job number|
| ---- |
| full name |
| phone1 |
| phone2 |
| address-line-1 |
| address-line-2 |
| email |
| jobtype1 |
| jobtype2 |
Using tabula-py, I can run the code df = tabula.read_pdf_with_template("file.pdf","template.tabula-template.json")
The output of print(df)
is:
[Empty DataFrame
Columns: [full name]
Index: [], Empty DataFrame
Columns: [phone1]
Index: [], Empty DataFrame
Columns: [phone2]
Index: [], Empty DataFrame
Columns: [email]
Index: [], address-line-1
0 address-line-2
Columns: [jobtype]
Index: [], Empty DataFrame
Columns: [jobtype2]
Index: [], Empty DataFrame
Columns: [jobnumber]
Index: []]
Note: I have changed the output text to the field name
None of the sections that I have selected have a header. Notice the address section shows the address line 1 as the header and address line 2 as the value. This is due to the address being on 2 lines in the PDF.
I have 3 questions:
- How do I get the address to be on one line?
- How do I export the output to csv? I know I have to loop through the DataFrame(s) to output one at a time, but python is not my strong suit
- I have over 3000 work orders to process. Is it possible to have the extract appended to the same csv file?
All code used to produce the above results:
import tabula
import pandas as pd
df = tabula.read_pdf_with_template("file.pdf","template.tabula-template.json")
print(df)