2

I'm working with iText5 to parse a pdf written mostly in Hebrew.
To extract the text I use PdfTextExtractor.getTextFromPage. I didn't find a way to change the encoding in the library and the text appears in ​gibberish.

I tried to fix the encoding like this:
new String(pdfPage.getBytes(Charset1), Charset2).
I went through all possible charsets using Charset.availableCharsets() and few of them gave me Hebrew instead of gibberish but reversed.

Now I thought I can reverse the text line by line, but Hebrew it right to left and number and English are left to right. So if I reverse the line, it fixes the Hebrew but breaks the numbers/English.

Example:

PdfTextExtractor.getTextFromPage returns 87.55 úåáééçúä ééåëéð ë"äñ

new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255")) returns 87.55 תובייחתה ייוכינ כ"הס

if I reverse this then I get סה"כ ניכויי התחייבות 55.78​ ​

The number should be 87.55 and not 55.78

The only solution I found is to split it into Hebrew and the rest (English/numbers) and reverse only the Hebrew parts and then merge it back.

Isn't there an easier solution? I feel like I'm missing something with the encoding/RTL

Boaz
  • 1,212
  • 11
  • 25
  • Please share a sample PDF that illustrates the issue. – mkl Aug 13 '18 at 15:37
  • I cant share the document I'm working on because it contains PII. But after searching Goole for pdf with gibberish, I found this [document](http://www.mchp.gov.il/pikuach_pnimiyati/merkazia_chinuchit/Documents/nifradnu_cach.pdf) - the **last paragraph** of the document has exactly the same problem I have in my documents. – Boaz Aug 13 '18 at 20:55

3 Answers3

1

I cant share the document I'm working on because it contains PII. But after searching Goole for pdf with gibberish, I found this document - the last paragraph of the document has exactly the same problem I have in my documents.

I can only analyze the data given, so in this case only the linked government paper from which

screen shot

is extracted as

ìëéî ìù "íééç éøåùéë" øôñá ,äéãôåìòôäá íéáø úåðåéòø ãåò àåöîì ïúéð 
.ãåòå úéëåðéçä äééæëøîá ,567 'îò ,ïîöìæ éìéðå ì÷ðøô äéæø ,ïîæåø

And in this case the reason for the gibberish output is simple: The PDF claims that this gibberish is indeed the text there!

Thus, the problem is not the text extractor, be it the iText PdfTextExtractor, Adobe Reader copy&paste, or whichever. Instead the problem is the document which lies about its contents

In more detail

The font TT1 used for this paragraph has a ToUnicode entry with the following mappings:

28 beginbfchar
<0003> <0020>
<0005> <0022>
<000a> <0027>
<000f> <002C>
<0011> <002E>
<001d> <003A>
<0069> <00E1>
<006a> <00E0>
<006b> <00E2>
<006c> <00E4>
<006d> <00E3>
<006e> <00E5>
<006f> <00E7>
<0070> <00E9>
<0071> <00E8>
<0074> <00ED>
<0075> <00EC>
<0078> <00F1>
<0079> <00F3>
<007a> <00F2>
<007b> <00F4>
<007c> <00F6>
<007e> <00FA>
<007f> <00F9>
<0096> <00E6>
<0097> <00F8>
<00ab> <00F7>
<00d5> <00F0>
endbfchar
3 beginbfrange
<0018> <001a> <0035>
<0072> <0073> <00EA>
<0076> <0077> <00EE>
endbfrange 

I.e. all codes are mapped to Unicode values between U+0020 and U+00F9, a Unicode range in which clearly the Hebrew characters one sees in the screen shot are not located. More exactly: aside from space, some punctuation, and digits (which are extracted correctly) the values are in the range between U+00E0 and U+00F9, a region where Latin letters with accents and their ilk are located.

You mention that in some case you could retrieve the Hebrew text by applying

new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255"))

So probably the PDF creation tool has put mappings to the Windows-1255 codepage into the ToUnicode map. Which obviously is wrong, the ToUnicode map must contain mappings to Unicode.


That all been said, even if the ToUnicode mappings were correct, you'd still have to fight with reversed Hebrew output. This indeed is a limitation of iText 5.x text extraction, it has no special support for RTL languages. Thus, you'll have to change the order of the characters in the result string yourself.

In this answer you'll find an example of such a re-ordering method. It is for Arabic script and it is in C# but it clearly shows how to proceed.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
1

First of all a most appropriate Hebrew byte character set is "ISO-8859-8" (better then windows-1255). try to play with this. Also, I would try to extract String using charset UTF-8.

Also there is a great diagnostic tool that helped me to diagnose and resolve countless thorny encoding issues related to Hebrew and Arabic

There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
    result = "שלום את";
    result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
    System.out.println(result);
    result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
    System.out.println(result);

The output of this code is:

\u05e9\u05dc\u05d5\u05dd\u0020\u05d0\u05ea
שלום את

Here is javadoc for the class StringUnicodeEncoderDecoder As you can see the Unicode symbols for Hebrew is U+05** where the first Hebrew letter (Alef -א) is U+05d0 and the last Hebrew letter (Tav - ת) is U+05ea.

The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc

So what I would do first is to get your original String and convert it to unicode sequence and see what you actually getting there. If the data is not correct then try to extract bytes and build a string with UTF-8. Anyway I would strongly recommend to use this utility as it helped me many times.
Michael Gantman
  • 7,315
  • 2
  • 19
  • 36
  • I played with the data that is given in mkl answer. And converting data from ISO-8859-1 to ISO-8859-8 gives you reversed Hebrew. So, the data is "screwed". You will have to reverse Hebrew String and then reverse all latin character sequences again to get your numbers back in correct order. Essentially, I am confirming that mkl's answer is correct – Michael Gantman Aug 21 '18 at 10:05
1

Using ICU did the job:

Bidi bidi = new Bidi();
bidi.setPara(input, Bidi.RTL, null);
String output = bidi.writeReordered(Bidi.DO_MIRRORING);
Boaz
  • 1,212
  • 11
  • 25