I am on Windows 7 32 bit. When I parse russian text PDF i recieve results file with ??? instead of russian characters. The developer addresses this issue with this fix
I got ? character with result on Windows. How can I avoid it? If the encoding of PDF is UTF-8, you should set chcp 65001 on your terminal before launching a Python process.
chcp 65001
I changed this in windows cmd but with no resul.
my code
import tabula
tabula.convert_into(r"C:\Code\Active\kartoteka\misc\ExampleExtract.pdf", r"C:\Code\Active\kartoteka\misc\output.csv", output_format="csv",pages = "all",java_options="-Dfile.encoding=utl-8")
Error log:
?? 10, 2018 11:15:18 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Can't read the embedded font Times-Roman
??? 10, 2018 11:15:18 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Using font Times New Roman instead
??? 10, 2018 11:15:19 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Can't read the embedded font Times-Roman
??? 10, 2018 11:15:19 PM org.apache.pdfbox.pdmodel.font.PDCIDFontType2Font getawtFont
INFO: Using font Times New Roman instead
My resulting file still shows all russian characters in ????? How do you fix this issue?