0

I have a bunch of PDF (1.4) files printed from Word with Adobe Distiller 6. Fonts are embedded (Tahoma and Times New Roman, which I have on my Linux machine) and encoding says "ANSI" and "Identity-H". Now by ANSI, I assume that regional code-page is used from Windows machine, which is CP-1251 (Cyrillic), and about "Identity-H" I assume that's something that only Adobe knows about.

I want to extract only text and index this files. Problem is I get garbage output from pdftotext. I tried to export example PDF file from Acrobat, and I again got garbage, but additionally processing with iconv got me right data:

iconv -f windows-1251 -t utf-8 Adobe-exported.txt

But same trick doesn't work with pdftotext:

pdftotext -raw -nopgbrk sample.pdf - | iconv -f windows-1251 -t utf-8

which by default assumes UTF-8 encoding, and outputs some garbage afterwhich: Сiconv: illegal input sequence at position 77

pdftotext -raw -nopgbrk -enc Latin1 sample.pdf - | iconv -f windows-1251 -t utf-8

throws garbage again.

In /usr/share/poppler/unicodeMap I don't have CP1251, and couldn't find it with Google, so tried to make one. I created the file from wikipedia CP1251 data, and appended at the end of file, what other maps had:

...
fb00 6666
fb01 6669
fb02 666c
fb03 666669
fb04 66666c

so that pdftotext does not complain, but result from:

pdftotext -enc CP1251 sample.pdf -

is same garbage again. hexdump does not reveal anything on first sight, and I thought to ask about my problem here, before trying desperately to conclude something from this hexdumps

sashoalm
  • 75,001
  • 122
  • 434
  • 781
theta
  • 24,593
  • 37
  • 119
  • 159
  • Many programs, working with pdf never support non-latin letters. `iconv` may be not enough in such case, but some things like `utf8_encode`. – kirilloid Mar 25 '12 at 04:15
  • I don't know of `pdftotext` alternative for this kind of job. Opening all documents, one by one in Acrobat, then exporting to text would be tiresome task, and embarrassing for human to do such repetitive task which is meant to be done by computer, by the nature – theta Mar 25 '12 at 04:32

0 Answers0