1

I am on Windows 7 32 bit. When I parse russian text PDF i recieve results file with ??? instead of russian characters. The developer addresses this issue with this fix

I got ? character with result on Windows. How can I avoid it? If the encoding of PDF is UTF-8, you should set chcp 65001 on your terminal before launching a Python process.

chcp 65001

I changed this in windows cmd but with no resul.

my code

import tabula


tabula.convert_into(r"C:\Code\Active\kartoteka\misc\ExampleExtract.pdf", r"C:\Code\Active\kartoteka\misc\output.csv", output_format="csv",pages = "all",java_options="-Dfile.encoding=utl-8")

Error log:

?? 10, 2018 11:15:18 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Can't read the embedded font Times-Roman
??? 10, 2018 11:15:18 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Using font Times New Roman instead
??? 10, 2018 11:15:19 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Can't read the embedded font Times-Roman
??? 10, 2018 11:15:19 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Using font Times New Roman instead

My resulting file still shows all russian characters in ????? How do you fix this issue?

Thats how original PDF looks. enter image description here

Billy Jhon
  • 1,035
  • 15
  • 30
  • Is it correct java_options or typo? It should be `java_options="-Dfile.encoding=UTF8"`. see also: https://stackoverflow.com/questions/6031877/jvm-property-dfile-encoding-utf8-or-utf-8 – chezou Aug 25 '18 at 13:16

1 Answers1

0

Nota bene: my comment is regarding the ability to extract text properly from a PDF in general versus tablula-py specifically, but hopefully this helps you determine if the problem lies with your PDF or with your PDF software.

It's difficult to comment on the file you are looking at without seeing it but a good starting point is to try Acrobat and by either copying the text and pasting it into a text editor or doing a search for the text content will reveal if it can be extracted correctly or not.

If it can't be extracted properly then there's a good chance the font is lacking a ToUnicode entry (see Section 9.10.1 of the ISO PDF 32000-1:2008 specification for more information).

If Acrobat can extract the text properly then it's possible there is an issue with the PDF software you are using.

JosephA
  • 1,187
  • 3
  • 13
  • 27