0

I'm using Xpdf's pdftotext to get the text out of some hebrew pdf files on Ubuntu.

On my local machine this worked fine. I then tried to do it on another machine and the hebrew characters don't show up in the text file. I verified that I have the language package (see below why I think so). Where else can I look for the problem?

>> tail -2 /etc/xpdf/xpdfrc
include /etc/xpdf/includes

>> cat /etc/xpdf/includes
# This file was automatically generated by /usr/sbin/update-xpdfrc.
# Instead, add or remove files in /etc/xpdf/ then run
# /usr/sbin/update-xpdfrc to regenerate this file.
include /etc/xpdf/xpdfrc-latin2
include /etc/xpdf/xpdfrc-thai
include /etc/xpdf/xpdfrc-greek
include /etc/xpdf/xpdfrc-turkish
include /etc/xpdf/xpdfrc-arabic
include /etc/xpdf/xpdfrc-hebrew
include /etc/xpdf/xpdfrc-cyrillic

>> cat /etc/xpdf/xpdfrc-hebrew
#----- begin Hebrew support package (2003-feb-16)
unicodeMap  ISO-8859-8  /usr/share/xpdf/hebrew/ISO-8859-8.unicodeMap
unicodeMap  Windows-1255    /usr/share/xpdf/hebrew/Windows-1255.unicodeMap
#----- end Hebrew support package

>> ls /usr/share/xpdf/hebrew/
ISO-8859-8.unicodeMap  Windows-1255.unicodeMap
Ofri Raviv
  • 155
  • 2
  • 5

2 Answers2

3

Luckily, the friendly Ubuntu people made it easy to install languages. Simply enter this command into your shell:

sudo apt-get install language-support-he language-pack-he

You will notice it adds hebrew support to quite a few other sub-systems (such as HSpell, Myspell and PostgreSQL for example), and installs some Hebrew fonts as well.

For good measure, install the following hebrew fonts:

sudo apt-get install culmus culmus-fancy xfonts-efont-unicode xfonts-efont-unicode-ib xfonts-intl-european msttcorefonts

And finally, make sure that when you run pdftotext, that you specify the UTF-8 encoding format, as it may not detect your source automatically:

pdftotext -enc UTF-8 input.pdf output.txt
1

You should have a look at TET, the text extraction toolkit by PDFlib.com (run by Thomas Merz', author of "PostScript and PDF Bible").

TET mainly is a library to use within other PDF processing applications, but they've also...

  • ...built a powerful commandline tool on top of it, called 'TET iFilter' (free as in beer);
  • ...built an Acrobat plugin (free as in beer)

This one can extract non-ASCII text from PDFs (inkl. CJK, Hebrew, Arabic), restore ligature glyphs to their original character pairs or trios and in general it runs circles around Adobe's own text extraction capabilities...

It's available for Windows, Linux, Mac OS X and various Unix systems.

Kurt Pfeifle
  • 1,796
  • 2
  • 12
  • 19