I am using Python (3.8.1) and tabula-py (2.1.0) (https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.build_options) to extract tables from a text based PDF file (Monthly AWS billing report).
Below a sample of the PDF file is shown (bottom of 1st page and top of 2nd page).
The Python script is shown below:
from tabula import read_pdf
from tabulate import tabulate
df = read_pdf(
"my_report.pdf",
output_format="dataframe",
multiple_tables=True,
pages="all",
silent=True,
# TODO: area = (x_left, x_right, y_left, y_right) # ?
)
print(tabulate(df))
Which generates the following output:
--- --------------------------------------------------------------------------- --------------------- ---------
0 region nan nan
1 AWS CloudTrail APS2-PaidEventsRecorded nan $3.70
2 0.00002 per paid event recorded in Asia Pacific (Sydney) 184,961.000 Events $3.70
3 region nan nan
4 Asia Pacific (Tokyo) nan $3.20
My thought is that the area option has to be properly set, since the top- and the left-most data is sometimes omitted. Is this the case, and if so, how do you find the correct area of all tabular data within the PDF file?
Thanks in advance.