0

I parsed 3 documents to fetch tables. The results as follow:

  1. Document 1: Perfect parsing.
  2. Document 2: got Jul 16, 2019 5:25:42 PM org.apache.pdfbox.pdmodel.font.PDType1Font WARNING: Using fallback font NimbusSanL-Bold for Univers-Bold Not sure if this is related but the second page was parsed and the first one was not.
  3. Document 3: Got Jul 17, 2019 10:21:25 AM org.apache.pdfbox.pdmodel.font.PDType1Font WARNING: Using fallback font NimbusSanL-Regu for Univers. Nothing was parsed from this one.

These are the current tabula parsing settings:

     rows = tabula.read_pdf(filename,
                       pages='all',
                       silent=True,
                       pandas_options={
                           'header': None,
                           'error_bad_lines': False,
                           'warn_bad_lines': False
                       })

Are there other settings that might solve this particular problem.

1 Answers1

0

The warnings came from PDFBox which is depended by tabula-java. Unfortunately, the problem itself comes from PDF itself and no way to workaround with tabula-py.

chezou
  • 486
  • 4
  • 12