I am trying to extract a table from the following pdf file using tabula-py
:
link to pdf
However, I encounter the following error:
WARNING:tabula.io:Got stderr: Jan 17, 2023 1:28:52 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
WARNING: No Unicode mapping for CID+18103 (18103) in font NRPIKV+MS-PGothic-90ms-RKSJ-H-Identity-H
Jan 17, 2023 1:28:52 AM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode
The error was produced when using the read_pdf()
function:
import tabula
import pandas as pd
dfs = tabula.read_pdf('https://www.anahd.co.jp/group/pr/pdf/20230110.pdf?_gl=1*an9okz*_ga*NjIwMjE1NzkxLjE2NzM4NDk2NTQ.*_ga_32F297W9WL*MTY3MzkxOTA1Mi4yLjAuMTY3MzkxOTA1Mi4wLjAuMA..', lattice=True, pages = ['4'])
which gives me an empty list.
There is a table on page 4, so ideally it should give me a pandas dataframe.
I have tried setting the lattice option to False
but it made no difference.
I tried running this both on a Mac and on Google colab.