3

The PDF file content is Chinese(characters, not pictures and so on), so the it may use different fonts. My code:

>>> import tabula
>>> df = tabula.read_pdf('/data/proj/smartinvestment/cninfo_download_reports/pdf/601101/2016-12-29/1202969937.PDF', pages='all')

The Error:

Feb 02, 2018 6:44:34 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2 <init>
INFO: OpenType Layout tables used in font ABCDEE+ËÎÌå are not implemented in PDFBox and will be ignored

The final DataFrame is empty.

I can not find any idea from stackoverflow. How can I fix the issue? should I import some fonts or, there is any other reason caused this error?

Mark
  • 31
  • 2

1 Answers1

1

I feel your pain. However, I am getting data in my dataframe (df) doing similar steps to yours. To troubleshoot, look at the type of your df being returned:

import tabula

pdf_file_name = "my_filename.pdf"
df = tabula.read_pdf(pdf_file_name,
                     encoding='Ansi') # or encoding='utf-8'

print(type(df))
# df.to_csv("output.csv", index=False)

It is quite possible that, due to you having pages="all", your df is a list of df's, which would require you to look into each df in the list to see evidence of your data.

Also, if the multiple_tables parameter for tabula.read_pdf is set to True, df would be a list of df's, and, again, this would also require you to look into each df in the list to see your data.

Thom Ives
  • 3,642
  • 3
  • 30
  • 29