I have a bunch of PDF (1.4) files printed from Word with Adobe Distiller 6. Fonts are embedded (Tahoma and Times New Roman, which I have on my Linux machine) and encoding says "ANSI" and "Identity-H". Now by ANSI, I assume that regional code-page is used from Windows machine, which is CP-1251 (Cyrillic), and about "Identity-H" I assume that's something that only Adobe knows about.
I want to extract only text and index this files. Problem is I get garbage output from pdftotext
.
I tried to export example PDF file from Acrobat, and I again got garbage, but additionally processing with iconv
got me right data:
iconv -f windows-1251 -t utf-8 Adobe-exported.txt
But same trick doesn't work with pdftotext
:
pdftotext -raw -nopgbrk sample.pdf - | iconv -f windows-1251 -t utf-8
which by default assumes UTF-8 encoding, and outputs some garbage afterwhich: Сiconv: illegal input sequence at position 77
pdftotext -raw -nopgbrk -enc Latin1 sample.pdf - | iconv -f windows-1251 -t utf-8
throws garbage again.
In /usr/share/poppler/unicodeMap
I don't have CP1251, and couldn't find it with Google, so tried to make one. I created the file from wikipedia CP1251 data, and appended at the end of file, what other maps had:
...
fb00 6666
fb01 6669
fb02 666c
fb03 666669
fb04 66666c
so that pdftotext
does not complain, but result from:
pdftotext -enc CP1251 sample.pdf -
is same garbage again. hexdump
does not reveal anything on first sight, and I thought to ask about my problem here, before trying desperately to conclude something from this hexdumps