0

I am using Python (3.8.1) and tabula-py (2.1.0) (https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.build_options) to extract tables from a text based PDF file (Monthly AWS billing report).

Below a sample of the PDF file is shown (bottom of 1st page and top of 2nd page).

PDF sample


The Python script is shown below:

from tabula import read_pdf
from tabulate import tabulate

df = read_pdf(
   "my_report.pdf",
   output_format="dataframe",
   multiple_tables=True,
   pages="all",
   silent=True,
   # TODO: area = (x_left, x_right, y_left, y_right) # ?
)

print(tabulate(df))


Which generates the following output:

---  ---------------------------------------------------------------------------  ---------------------  ---------
  0  region                                                                       nan                    nan
  1  AWS CloudTrail APS2-PaidEventsRecorded                                       nan                    $3.70
  2  0.00002 per paid event recorded in Asia Pacific (Sydney)                     184,961.000 Events     $3.70
  3  region                                                                       nan                    nan
  4  Asia Pacific (Tokyo)                                                         nan                    $3.20

My thought is that the area option has to be properly set, since the top- and the left-most data is sometimes omitted. Is this the case, and if so, how do you find the correct area of all tabular data within the PDF file?

Thanks in advance.

Gustav Rasmussen
  • 3,720
  • 4
  • 23
  • 53
  • 1
    PostScript is a page-layout language. It preserves virtually nothing of the structure of the source document (divisions, chapters, and so forth). So identifying tables in a PDF is more of an art than a science. There is no `table` tag in PostScript. `tabula-py` needs to infer the existence of a table simply from the layout. And there is no easy way to discover from the PDF that the table on page 2 is a continuation of the table on page 1, unless it has repeated headings. Maybe you could report this as an issue on https://github.com/chezou/tabula-py/issues . – BoarGules Mar 27 '20 at 09:52
  • @BoarGules Good point. I will report on this issue. Do you have any suggestion for a better solution to extracting the tabular data from this PDF into some dataframe, CSV-format or similar ? – Gustav Rasmussen Mar 27 '20 at 09:55

2 Answers2

3

Try using param "guess=False".

John Smith
  • 41
  • 3
0

I managed to solve this issue by extending the location of the data being searched:

# get locations from page 2 data:
tables = read_pdf("my_report.pdf", output_format="json", pages=2, silent=True)
top = tables[0]["top"]
left = tables[0]["left"]
bottom = tables[0]["height"] + top
right = tables[0]["width"] + left
# Expand location borders slightly:
test_area = [top - 20, left - 20, bottom + 10, right + 10]

# Now read_pdf gives all data with the following call:

df = read_pdf(
   "my_report.pdf",
   multiple_tables=True,
   pages="all",
   silent=True,
   area = test_area
)
Gustav Rasmussen
  • 3,720
  • 4
  • 23
  • 53
  • I surprised me that this worked, as the documentation for tabula-py (https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.build_options) states that the entire PDF page is searched by default. – Gustav Rasmussen Mar 27 '20 at 13:15