Tabula-py skips first page from PDF and misses some tabular data

Question

I am using Python (3.8.1) and tabula-py (2.1.0) (https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.build_options) to extract tables from a text based PDF file (Monthly AWS billing report).

Below a sample of the PDF file is shown (bottom of 1st page and top of 2nd page).

The Python script is shown below:

from tabula import read_pdf
from tabulate import tabulate

df = read_pdf(
   "my_report.pdf",
   output_format="dataframe",
   multiple_tables=True,
   pages="all",
   silent=True,
   # TODO: area = (x_left, x_right, y_left, y_right) # ?
)

print(tabulate(df))

Which generates the following output:

---  ---------------------------------------------------------------------------  ---------------------  ---------
  0  region                                                                       nan                    nan
  1  AWS CloudTrail APS2-PaidEventsRecorded                                       nan                    $3.70
  2  0.00002 per paid event recorded in Asia Pacific (Sydney)                     184,961.000 Events     $3.70
  3  region                                                                       nan                    nan
  4  Asia Pacific (Tokyo)                                                         nan                    $3.20

My thought is that the area option has to be properly set, since the top- and the left-most data is sometimes omitted. Is this the case, and if so, how do you find the correct area of all tabular data within the PDF file?

Thanks in advance.

PostScript is a page-layout language. It preserves virtually nothing of the structure of the source document (divisions, chapters, and so forth). So identifying tables in a PDF is more of an art than a science. There is no `table` tag in PostScript. `tabula-py` needs to infer the existence of a table simply from the layout. And there is no easy way to discover from the PDF that the table on page 2 is a continuation of the table on page 1, unless it has repeated headings. Maybe you could report this as an issue on https://github.com/chezou/tabula-py/issues . — BoarGules, Mar 27 '20 at 09:52
@BoarGules Good point. I will report on this issue. Do you have any suggestion for a better solution to extracting the tabular data from this PDF into some dataframe, CSV-format or similar ? — Gustav Rasmussen, Mar 27 '20 at 09:55

score 3 · Answer 1 · answered Sep 21 '20 at 20:41

3

Try using param "guess=False".

answered Sep 21 '20 at 20:41

John Smith

41
3

score 0 · Accepted Answer · answered Mar 27 '20 at 13:14

I managed to solve this issue by extending the location of the data being searched:

# get locations from page 2 data:
tables = read_pdf("my_report.pdf", output_format="json", pages=2, silent=True)
top = tables[0]["top"]
left = tables[0]["left"]
bottom = tables[0]["height"] + top
right = tables[0]["width"] + left
# Expand location borders slightly:
test_area = [top - 20, left - 20, bottom + 10, right + 10]

# Now read_pdf gives all data with the following call:

df = read_pdf(
   "my_report.pdf",
   multiple_tables=True,
   pages="all",
   silent=True,
   area = test_area
)

I surprised me that this worked, as the documentation for tabula-py (https://tabula-py.readthedocs.io/en/latest/tabula.html#tabula.io.build_options) states that the entire PDF page is searched by default. — Gustav Rasmussen, Mar 27 '20 at 13:15

Tabula-py skips first page from PDF and misses some tabular data

2 Answers2