4

enter image description hereI have this code and I'm using it to take the text of a PDF. It's great for a PDF in English but when I'm trying to extract the text in Arabic it shows me something like this.

") + n 9 n <+, + )+ $ # $ +$ F% 9& .< $ : ;"

using (PdfReader reader = new PdfReader(path))
{
     ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
     String text = "";
     for (int i = 1; i <= reader.NumberOfPages; i++)
     {
          text = PdfTextExtractor.GetTextFromPage(reader, i,strategy);
     }

}
Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54
  • This looks like the pdf does not contain the information required for text extraction according to the pdf specification. – mkl Nov 14 '16 at 19:43
  • Did you try this http://stackoverflow.com/questions/35436158/itextsharp-cant-extract-pdf-unicode-content-in-c-sharp ? – KMoussa Nov 14 '16 at 19:50
  • no there are a lot of words but the itextsharp codes the Arabic words – Ahmad Tarabeshi Nov 14 '16 at 19:53
  • @KMoussa it didont solve my problem still coding it like " , -%& ,. &$, $/ . % -% ) $ ( +% ) & !" +/ ) $ ( 12 . 3$) ( $ 45 .( 3$) %& ,5 6 7 !8$ # & * . 3$) +8 $ +8 9 3$, -: .( 3$) . +8 ). 15 + $ $ %& $ 7" $, $ ,5 . .( 3$) ) $ ) & ( . : ,5$ 3(& . -(5$$ 2) %& $5 8$2) $/ $ " – Ahmad Tarabeshi Nov 14 '16 at 20:05
  • Please share a sample file to reproduce the issue. – mkl Nov 14 '16 at 20:26
  • @mkl i add picture showing the output after extraction – Ahmad Tarabeshi Nov 14 '16 at 20:44
  • Obviously we cannot reproduce the issue using a picture... – mkl Nov 15 '16 at 02:03
  • A picture doesn't help. If you don't want to share the PDF, try doing `copy/paste` of the text in Adobe Reader? Do you get the same result as with iText? If so: you can't extract the Arabic text correctly because **the PDF doesn't contain the information required for text extraction according to the PDF specification** (which was the very first comment you got on this question). – Bruno Lowagie Nov 15 '16 at 07:02
  • here's the out :ق ديمحلا دبع : :يبابضلا قطنملا فيرعت اقيبطتو ةريبخلا ةمظنلأا ضعب يف مدختسي ،قطنملا لاكشأ دحأ وه ت ءاكذلا يعانصلا ماع قطنملا اذه أشن 5691 صلأا يناجيبرذلأا ملاعلا دي ىلع ل " an the orginal " هو أحد أشكال المنطق، يستخدم في بعض الأنظمة الخبيرة وتطبيقات الذكاء الصناعي نشأ هذا المنطق عام 5691 على يد العالم الأذربيجاني الأصل "ل ." i'm really grateful if you help me @BrunoLowagie – Ahmad Tarabeshi Nov 15 '16 at 16:49
  • Are you using the latest iText version? Older versions weren't able to extract Arabic text correctly. – Bruno Lowagie Nov 15 '16 at 17:03
  • i'm using version 5.5.8.0 – Ahmad Tarabeshi Nov 15 '16 at 17:07
  • @BrunoLowagie it's done i reverse the string but i want to save the English words do you have any idea – Ahmad Tarabeshi Nov 15 '16 at 17:26
  • You have to check the Unicode range and only reverse the characters in the "Arabic range". – Bruno Lowagie Nov 15 '16 at 17:47
  • @BrunoLowagie thnx sir it's done – Ahmad Tarabeshi Nov 15 '16 at 20:45

1 Answers1

5

I had to change the strategy like this

var t = PdfTextExtractor.GetTextFromPage(reader, i, new LocationTextExtractionStrategy());
var te = Convert(t);

and this function to reverse the Arabic words and keep the English

  private string Convert(string source)
  {
       string arabicWord = string.Empty;
       StringBuilder sbDestination = new StringBuilder();

       foreach (var ch in source)
       {
           if (IsArabic(ch))
               arabicWord += ch;
           else
           {
               if (arabicWord != string.Empty)
                    sbDestination.Append(Reverse(arabicWord));

               sbDestination.Append(ch);
               arabicWord = string.Empty;
            }
        }

        // if the last word was arabic    
        if (arabicWord != string.Empty)
            sbDestination.Append(Reverse(arabicWord));

        return sbDestination.ToString();
     }


     private bool IsArabic(char character)
     {
         if (character >= 0x600 && character <= 0x6ff)
             return true;

         if (character >= 0x750 && character <= 0x77f)
             return true;

         if (character >= 0xfb50 && character <= 0xfc3f)
             return true;

         if (character >= 0xfe70 && character <= 0xfefc)
             return true;

         return false;
     }

     // Reverse the characters of string
     string Reverse(string source)
     {
          return new string(source.ToCharArray().Reverse().ToArray());
     }
Joris Schellekens
  • 8,483
  • 2
  • 23
  • 54