3

I am trying to extract tables from a multi-page PDF with tabula-py, and while the tables on some of the pages of the PDF are extracted perfectly, some pages are omitted entirely.

The omissions seem to be random and don't follow any visible visual features on the PDF (as each page looks the same), and so tabula omitted page 1, extracted page 2, omitted pages 3 and 4, extracted page 5, omitted page 6, extracted pages 8 and 9, omitted 10, extracted 11, etc. I have macOS Sierra 10.12.6 and Python 3.6.3 :: Anaconda custom (64-bit).

I've tried splitting the PDF into shorter sections, even into one-pagers, but the pages that are omitted don't seem to be possible to extract no matter what I've tried. I've read the related documentation and filed issues on the Tabula-py GitHub page as well as here on Stack Overflow, but I don't seem to find a solution.

The code I use through iPython notebooks is as follows:

To install tabula through the terminal:

pip install tabula-py

To extract the tables in my PDF:

from tabula import read_pdf
df = read_pdf("document_name.pdf", pages="all")

I also tried the following, which didn't make any difference

df = read_pdf("document_name", pages="1-361")

To save the data frame into csv:

df.to_csv('document_name.csv')

I'd be really thankful if you could help me with this, as I feel like I'm stuck with a PDF from which I've only managed to extract around 50% of data. This is infuriating, as the 50% looks absolutely perfect, but the other 50% seems out of my reach and renders the larger project of analyzing the data impossible.

I also wonder if this might be an issue of the PDF rather than Tabula - could the file be mistakenly set as protected or locked and whether any of you knows how I could check for that and open it up?

Thanks a ton in advance!

Sannita
  • 131
  • 1
  • 4
  • 361 pages of table(s)? Could you be running out of memory? – Yohst Jul 30 '18 at 00:01
  • Thanks for the comment @Yohst! I don't think memory is an issue here. The first page of the 361-page document was omitted, and tried using the same code on that page alone (as a 1-page document). When I tried printing the data frame, I just received a response text stating 'None', so I think Tabula-py isn't working on this page, just like some of the other pages that were omitted. Would you have any other suggestions @Yohst? – Sannita Jul 30 '18 at 00:30
  • I tried running your set up but got a java error on read_pdf(). I take it tabula runs java in the background and may need a specific configuration. I am on macOS so dont want to fudz with that. Perhaps time to contact the authors of the lib? – Yohst Jul 30 '18 at 06:28
  • Thanks for taking another look @Yohst! I tried reinstalling java too, but already had the latest version. The author of the library @chezou asked all questions be directed to him through Stack Overflow, but I don't seem to have a way to notify him about my question, so just hoping for him to see this at some point. – Sannita Jul 30 '18 at 09:57

2 Answers2

1

This could be because the area of your data in the PDF file exceeds the area that is being read by tabula. Try the following:

First get the location of your data, by parsing one of the pages into JSON format (here I chose page 2), then extract and print the locations:

tables = read_pdf("document_name.pdf", output_format="json", pages=2, silent=True)
top = tables[0]["top"]
left = tables[0]["left"]
bottom = tables[0]["height"] + top
right = tables[0]["width"] + left
print(f"{top=}\n{bottom=}\n{left=}\n{right=}")

You can now try to expand these locations slightly by experimentation, until you receive more data from the PDF document:

# area = [top, left, bottom, right]
# Example from page 2 json output: area = [30.0, 59.0, 761.0, 491.0]
# You could then nudge these locations slightly to include a wider data area:
test_area = [10.0, 30.0, 770.0, 500.0]

df = read_pdf(
    "document_name.pdf",
    multiple_tables=True,
    pages="all",
    area=test_area,
    silent=True,  # Suppress all stderr output
)

and the df variable will now hold your tables with the PDF data.

Gustav Rasmussen
  • 3,720
  • 4
  • 23
  • 53
0

Try to use java_options like: java_options="-Xmx4g"

chezou
  • 486
  • 4
  • 12