extracting Arabic text in c# by using itextsharp

Question

I have this code and I'm using it to take the text of a PDF. It's great for a PDF in English but when I'm trying to extract the text in Arabic it shows me something like this.

") + n 9 n <+, + )+ $ # $ +$ F% 9& .< $ : ;"

using (PdfReader reader = new PdfReader(path))
{
     ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
     String text = "";
     for (int i = 1; i <= reader.NumberOfPages; i++)
     {
          text = PdfTextExtractor.GetTextFromPage(reader, i,strategy);
     }

}

This looks like the pdf does not contain the information required for text extraction according to the pdf specification. — mkl, Nov 14 '16 at 19:43
Did you try this http://stackoverflow.com/questions/35436158/itextsharp-cant-extract-pdf-unicode-content-in-c-sharp ? — KMoussa, Nov 14 '16 at 19:50
no there are a lot of words but the itextsharp codes the Arabic words — Ahmad Tarabeshi, Nov 14 '16 at 19:53
@KMoussa it didont solve my problem still coding it like " , -%& ,. &$, $/ . % -% ) $ ( +% ) & !" +/ ) $ ( 12 . 3$) ( $ 45 .( 3$) %& ,5 6 7 !8$ # & * . 3$) +8 $ +8 9 3$, -: .( 3$) . +8 ). 15 + $ $ %& $ 7" $, $ ,5 . .( 3$) ) $ ) & ( . : ,5$ 3(& . -(5$$ 2) %& $5 8$2) $/ $ " — Ahmad Tarabeshi, Nov 14 '16 at 20:05
A picture doesn't help. If you don't want to share the PDF, try doing `copy/paste` of the text in Adobe Reader? Do you get the same result as with iText? If so: you can't extract the Arabic text correctly because **the PDF doesn't contain the information required for text extraction according to the PDF specification** (which was the very first comment you got on this question). — Bruno Lowagie, Nov 15 '16 at 07:02
here's the out :ق ديمحلا دبع : :يبابضلا قطنملا فيرعت اقيبطتو ةريبخلا ةمظنلأا ضعب يف مدختسي ،قطنملا لاكشأ دحأ وه ت ءاكذلا يعانصلا ماع قطنملا اذه أشن 5691 صلأا يناجيبرذلأا ملاعلا دي ىلع ل " an the orginal " هو أحد أشكال المنطق، يستخدم في بعض الأنظمة الخبيرة وتطبيقات الذكاء الصناعي نشأ هذا المنطق عام 5691 على يد العالم الأذربيجاني الأصل "ل ." i'm really grateful if you help me @BrunoLowagie — Ahmad Tarabeshi, Nov 15 '16 at 16:49
Are you using the latest iText version? Older versions weren't able to extract Arabic text correctly. — Bruno Lowagie, Nov 15 '16 at 17:03
@BrunoLowagie it's done i reverse the string but i want to save the English words do you have any idea — Ahmad Tarabeshi, Nov 15 '16 at 17:26
You have to check the Unicode range and only reverse the characters in the "Arabic range". — Bruno Lowagie, Nov 15 '16 at 17:47

score 5 · Answer 1 · edited Jul 11 '18 at 08:48

I had to change the strategy like this

var t = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
var te = Convert(t);

and this function to reverse the Arabic words and keep the English

  private string Convert(string source)
  {
       string arabicWord = string.Empty;
       StringBuilder sbDestination = new StringBuilder();

       foreach (var ch in source)
       {
           if (IsArabic(ch))
               arabicWord += ch;
           else
           {
               if (arabicWord != string.Empty)
                    sbDestination.Append(Reverse(arabicWord));

               sbDestination.Append(ch);
               arabicWord = string.Empty;
            }
        }

        // if the last word was arabic    
        if (arabicWord != string.Empty)
            sbDestination.Append(Reverse(arabicWord));

        return sbDestination.ToString();
     }


     private bool IsArabic(char character)
     {
         if (character >= 0x600 && character <= 0x6ff)
             return true;

         if (character >= 0x750 && character <= 0x77f)
             return true;

         if (character >= 0xfb50 && character <= 0xfc3f)
             return true;

         if (character >= 0xfe70 && character <= 0xfefc)
             return true;

         return false;
     }

     // Reverse the characters of string
     string Reverse(string source)
     {
          return new string(source.ToCharArray().Reverse().ToArray());
     }

None of the characters the output in your question shows is in the ranges tested for in `IsArabic`. If the code from your answer really helps, therefore, you didn't present the data you really extracted in your question... — mkl, Nov 15 '16 at 21:06
actually it happens for some and espcially when it was old version :) — Ahmad Tarabeshi, Nov 16 '16 at 19:46
OK, thank you for sharing this code. I'm sure it will be helpful for other people too. — Bruno Lowagie, Nov 17 '16 at 12:48

extracting Arabic text in c# by using itextsharp

1 Answers1

Linked