0

I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. In both tabula-py and tabula-java, the output isn't human readable (definitely not Japanese characters), and there are no no error/warning messages. It does seem that the content of the PDF is processed though. unreadable tabula-java results

When using the standalone Tabula tool, the characters are encoded properly: readable tabula results

Searching online in the tabula-py and tabula-java documentation, and below are suggestions I could find, but these don't change the output.

  1. Setting the -Dfile.encoding=utf8 (in java call to tabula-py or tabula-java)
  2. Setting chcp 65001 (in Windows command prompt)

I understand Tabula and tabula-java (and tabula-py) use the same library, but is there something different between the two that would explain the difference in encoding output?

Dragonthoughts
  • 2,180
  • 8
  • 25
  • 28
Wah123
  • 1
  • 1
  • Hi, I've just updated the link with the PDF. In your reply, I see that the Japanese is being displayed, however in the first image of my original post, I don't see these characters – Wah123 Jan 08 '23 at 07:29
  • It looks as if you are trying to interpret JIS or Shift-JIS as UTF-8. – Dragonthoughts Jan 09 '23 at 09:44

1 Answers1

0

Background info

There is nothing unusual in this PDF compared to any other. The text like any PDF is written in authors random order so for example the 1st PDF body Line (港区内認可保育園等一覧) is the 1262nd block of text added long after the table was started. To hear written order we can use Read Aloud, to verify character and language recognition but unless the PDF was correctly tagged it will also jump from text block to block

enter image description here

So internally the text is rarely tabular the first 8 lines are

1 認可保育園
0歳 1歳 2歳3歳4歳5歳 計
短時間 標準時間
001010 区立
3か月
3455-
4669
芝5-18-1-101

Thus you need text extractors that work in a grid like manner or convert the text layout into a row by row output.

This is where all extractors will be confounded as to how to output such a jumbled dense layout and generally ALL will struggle with this page.

Hence its best to use a good generic solution. It will still need data cleaning but at least you will have some thing to work on.

If you only need a zone from the page it is best to set the boundary of interest to avoid extraneous parsing.

Your "standalone Tabula tool" output is very good but could possibly be better by use pdftotext -layout and adjust some options to produce amore regular order.

enter image description here

Your Question

the difference in encoding output?

The Answer

The output from pdf is not the internal coding, so the desired text output is UTF-8, but PDF does not store the text as UTF-8 or unicode it simply uses numbers from a font character map. IF the map is poor everything would be gibberish, however in this case the map is good, so where does the gibberish arise? It is because that out part is not using UTF-8 and console output is rarely unicode.

You correctly show that console needs to be set to Unicode mode then the output should match (except for the density problem)

enter image description here

The density issue would be easier to handle if preprocessed in a flowing format such as HTML enter image description here or using a different language enter image description here

K J
  • 8,045
  • 3
  • 14
  • 36
  • Thank you for this analysis. Further testing it seems that calling tabula-java actually outputs the correct encoding (I am ignoring whether the contents is accurate at the moment), even tabula-py also outputs the correct encoding. This is aligned to your findings as well regarding the PDF. The issue now seems to be why tabula-py seems to store gibberish into the pandas DataFrame (instead of UTF-8) using the command "tabula.read_pdf('minato.pdf', java_options="-Dfile.encoding=UTF8", pages=1, lattice=True, output_format="dataframe", encoding="utf-8")" – Wah123 Jan 11 '23 at 02:31