Reversed Hebrew or numbers after using iText to parse a PDF document

Question

I'm working with iText5 to parse a pdf written mostly in Hebrew.
To extract the text I use PdfTextExtractor.getTextFromPage. I didn't find a way to change the encoding in the library and the text appears in gibberish.

I tried to fix the encoding like this:
new String(pdfPage.getBytes(Charset1), Charset2).
I went through all possible charsets using Charset.availableCharsets() and few of them gave me Hebrew instead of gibberish but reversed.

Now I thought I can reverse the text line by line, but Hebrew it right to left and number and English are left to right. So if I reverse the line, it fixes the Hebrew but breaks the numbers/English.

Example:

PdfTextExtractor.getTextFromPage returns 87.55 úåáééçúä ééåëéð ë"äñ

new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255")) returns 87.55 תובייחתה ייוכינ כ"הס

if I reverse this then I get סה"כ ניכויי התחייבות 55.78

The number should be 87.55 and not 55.78

The only solution I found is to split it into Hebrew and the rest (English/numbers) and reverse only the Hebrew parts and then merge it back.

Isn't there an easier solution? I feel like I'm missing something with the encoding/RTL

I cant share the document I'm working on because it contains PII. But after searching Goole for pdf with gibberish, I found this [document](http://www.mchp.gov.il/pikuach_pnimiyati/merkazia_chinuchit/Documents/nifradnu_cach.pdf) - the **last paragraph** of the document has exactly the same problem I have in my documents. — Boaz, Aug 13 '18 at 20:55

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

I cant share the document I'm working on because it contains PII. But after searching Goole for pdf with gibberish, I found this document - the last paragraph of the document has exactly the same problem I have in my documents.

I can only analyze the data given, so in this case only the linked government paper from which

is extracted as

ìëéî ìù "íééç éøåùéë" øôñá ,äéãôåìòôäá íéáø úåðåéòø ãåò àåöîì ïúéð 
.ãåòå úéëåðéçä äééæëøîá ,567 'îò ,ïîöìæ éìéðå ì÷ðøô äéæø ,ïîæåø

And in this case the reason for the gibberish output is simple: The PDF claims that this gibberish is indeed the text there!

Thus, the problem is not the text extractor, be it the iText PdfTextExtractor, Adobe Reader copy&paste, or whichever. Instead the problem is the document which lies about its contents

In more detail

The font TT1 used for this paragraph has a ToUnicode entry with the following mappings:

28 beginbfchar
<0003> <0020>
<0005> <0022>
<000a> <0027>
<000f> <002C>
<0011> <002E>
<001d> <003A>
<0069> <00E1>
<006a> <00E0>
<006b> <00E2>
<006c> <00E4>
<006d> <00E3>
<006e> <00E5>
<006f> <00E7>
<0070> <00E9>
<0071> <00E8>
<0074> <00ED>
<0075> <00EC>
<0078> <00F1>
<0079> <00F3>
<007a> <00F2>
<007b> <00F4>
<007c> <00F6>
<007e> <00FA>
<007f> <00F9>
<0096> <00E6>
<0097> <00F8>
<00ab> <00F7>
<00d5> <00F0>
endbfchar
3 beginbfrange
<0018> <001a> <0035>
<0072> <0073> <00EA>
<0076> <0077> <00EE>
endbfrange

I.e. all codes are mapped to Unicode values between U+0020 and U+00F9, a Unicode range in which clearly the Hebrew characters one sees in the screen shot are not located. More exactly: aside from space, some punctuation, and digits (which are extracted correctly) the values are in the range between U+00E0 and U+00F9, a region where Latin letters with accents and their ilk are located.

You mention that in some case you could retrieve the Hebrew text by applying

new String(text.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("windows-1255"))

So probably the PDF creation tool has put mappings to the Windows-1255 codepage into the ToUnicode map. Which obviously is wrong, the ToUnicode map must contain mappings to Unicode.

That all been said, even if the ToUnicode mappings were correct, you'd still have to fight with reversed Hebrew output. This indeed is a limitation of iText 5.x text extraction, it has no special support for RTL languages. Thus, you'll have to change the order of the characters in the result string yourself.

In this answer you'll find an example of such a re-ordering method. It is for Arabic script and it is in C# but it clearly shows how to proceed.

The answer for the Arabic script doesn't handle punctuation correctly. — Boaz, Aug 27 '18 at 12:24
Ah, ok. Well, using ICU (as you do now) should be advanced enough for that. — mkl, Aug 27 '18 at 13:00

score 1 · Answer 2 · answered Aug 21 '18 at 09:05

First of all a most appropriate Hebrew byte character set is "ISO-8859-8" (better then windows-1255). try to play with this. Also, I would try to extract String using charset UTF-8.

Also there is a great diagnostic tool that helped me to diagnose and resolve countless thorny encoding issues related to Hebrew and Arabic

There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:

    result = "שלום את";
    result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
    System.out.println(result);
    result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
    System.out.println(result);

The output of this code is:

\u05e9\u05dc\u05d5\u05dd\u0020\u05d0\u05ea
שלום את

Here is javadoc for the class StringUnicodeEncoderDecoder As you can see the Unicode symbols for Hebrew is U+05** where the first Hebrew letter (Alef -א) is U+05d0 and the last Hebrew letter (Tav - ת) is U+05ea.

The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc

So what I would do first is to get your original String and convert it to unicode sequence and see what you actually getting there. If the data is not correct then try to extract bytes and build a string with UTF-8. Anyway I would strongly recommend to use this utility as it helped me many times.

I played with the data that is given in mkl answer. And converting data from ISO-8859-1 to ISO-8859-8 gives you reversed Hebrew. So, the data is "screwed". You will have to reverse Hebrew String and then reverse all latin character sequences again to get your numbers back in correct order. Essentially, I am confirming that mkl's answer is correct — Michael Gantman, Aug 21 '18 at 10:05

score 1 · Accepted Answer · answered Aug 27 '18 at 12:04

1

Using ICU did the job:

Bidi bidi = new Bidi();
bidi.setPara(input, Bidi.RTL, null);
String output = bidi.writeReordered(Bidi.DO_MIRRORING);

answered Aug 27 '18 at 12:04

Boaz

1,212
11
25

Reversed Hebrew or numbers after using iText to parse a PDF document

3 Answers3

In more detail