0

I am trying to extract a table from the following pdf file using tabula-py: link to pdf

However, I encounter the following error:

WARNING:tabula.io:Got stderr: Jan 17, 2023 1:28:52 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+18103 (18103) in font NRPIKV+MS-PGothic-90ms-RKSJ-H-Identity-H
Jan 17, 2023 1:28:52 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode

The error was produced when using the read_pdf() function:

import tabula
import pandas as pd

dfs = tabula.read_pdf('https://www.anahd.co.jp/group/pr/pdf/20230110.pdf?_gl=1*an9okz*_ga*NjIwMjE1NzkxLjE2NzM4NDk2NTQ.*_ga_32F297W9WL*MTY3MzkxOTA1Mi4yLjAuMTY3MzkxOTA1Mi4wLjAuMA..', lattice=True, pages = ['4'])

which gives me an empty list.

There is a table on page 4, so ideally it should give me a pandas dataframe.

I have tried setting the lattice option to False but it made no difference. I tried running this both on a Mac and on Google colab.

  • 1
    *"I encounter the following error"* - If you look more thoroughly, it's a warning, no error. And it says that the PDF does not contain the information (on PDF level) required for text extraction. – mkl Jan 17 '23 at 08:29

0 Answers0