Arabic PDF Text Extractor

Question

Is there any pdf text extractor api that extract arabic text from pdf.

I am using itextpdf api it works fine in extract English but it doesn't extract arabic text.
This is my code for extract text in pdf:

private String extractPDF(String path) throws IOException {

        String parsedText = "";
        PdfReader reader = new PdfReader(path);
        int n = reader.getNumberOfPages();
        for (int page = 0; page < n; page++) {
            parsedText = parsedText + PdfTextExtractor.getTextFromPage(reader, page + 1).trim() + "\n"; //Extracting the content from the different pages
        }
        reader.close();

        return parsedText;
}

and this is the input pdf :arabic.pdf

Update :

i able to extract arabic text but it doesn't preserves the order of the lines , and this is my code:

private String extractPDF(String name) throws IOException {

    PdfReader reader = new PdfReader(name);
    StringBuilder text = new StringBuilder();
    for (int i=1;i<=reader.getNumberOfPages();i++){
        String data = PdfTextExtractor.getTextFromPage(reader,i,new SimpleTextExtractionStrategy());
        text.append(Bidi.BidiText(data,1).getText());
    }
    return text.toString();
}

pdf text is :

بسم الله الرحمن الرحيم

السلام عليكم ورحمة الله وبركاته

سبحان الله

the output is :

سبحان الله

السلام عليكم ورحمة الله وبركاته

بسم الله الرحمن الرحيم

this is my code for method BidiText:

public static BidiResult BidiText(String str, int startLevel)
{
    boolean isLtr = true;
    int strLength = str.length();
    if (strLength == 0)
    {
        return new BidiResult(str, false);
    }

    // get types, fill arrays

    char[] chars = new char[strLength];
    String[] types = new String[strLength];
    String[] oldtypes = new String[strLength];
    int numBidi = 0;

    for (int i = 0; i < strLength; ++i)
    {
        chars[i] = str.charAt(i);

        char charCode = str.charAt(i);
        String charType = "L";
        if (charCode <= 0x00ff)
        {
            charType = BaseTypes[charCode];
        }
        else if (0x0590 <= charCode && charCode <= 0x05f4)
        {
            charType = "R";
        }
        else if (0x0600 <= charCode && charCode <= 0x06ff)
        {
            charType = ArabicTypes[charCode & 0xff];
        }
        else if (0x0700 <= charCode && charCode <= 0x08AC)
        {
            charType = "AL";
        }

        if (charType.equals("R") || charType.equals("AL") || charType.equals("AN"))
        {
            numBidi++;
        }

        oldtypes[i] = types[i] = charType;
    }

    if (numBidi == 0)
    {
        return new BidiResult(str, true);
    }

    if (startLevel == -1)
    {
        if ((strLength / numBidi) < 0.3)
        {
            startLevel = 0;
        }
        else
        {
            isLtr = false;
            startLevel = 1;
        }
    }

    int[] levels = new int[strLength];

    for (int i = 0; i < strLength; ++i)
    {
        levels[i] = startLevel;
    }



    String e = IsOdd(startLevel) ? "R" : "L";
    String sor = e;
    String eor = sor;


    String lastType = sor;
    for (int i = 0; i < strLength; ++i)
    {
        if (types[i].equals("NSM"))
        {
            types[i] = lastType;
        }
        else
        {
            lastType = types[i];
        }
    }

    lastType = sor;
    for (int i = 0; i < strLength; ++i)
    {

        String t = types[i];
        if (t.equals("EN"))
        {
            types[i] = (lastType.equals("AL")) ? "AN" : "EN";
        }
        else if (t.equals("R") || t.equals("L") || t.equals("AL"))
        {
            lastType = t;
        }
    }



    for (int i = 0; i < strLength; ++i)
    {

        String t = types[i];
        if (t.equals("AL"))
        {
            types[i] = "R";
        }
    }



    for (int i = 1; i < strLength - 1; ++i)
    {
        if (types[i].equals("ES") && types[i - 1].equals("EN") && types[i + 1].equals("EN"))
        {
            types[i] = "EN";
        }
        if (types[i].equals("CS") && (types[i - 1].equals("EN") || types[i - 1].equals("AN")) && types[i + 1] == types[i - 1])
        {
            types[i] = types[i - 1];
        }
    }



    for (int i = 0; i < strLength; ++i)
    {
        if (types[i].equals("EN"))
        {
            // do before
            for (int j = i - 1; j >= 0; --j)
            {
                if (!types[j].equals("ET"))
                {
                    break;
                }
                types[j] = "EN";
            }
            // do after
            for (int j = i + 1; j < strLength; --j)
            {
                if (!types[j].equals("ET"))
                {
                    break;
                }
                types[j] = "EN";
            }
        }
    }



    for (int i = 0; i < strLength; ++i)
    {

        String t = types[i];
        if (t.equals("WS") || t.equals("ES") || t.equals("ET") || t.equals("CS"))
        {
            types[i] = "ON";
        }
    }


    lastType = sor;
    for (int i = 0; i < strLength; ++i)
    {

        String t = types[i];
        if (t.equals("EN"))
        {
            types[i] = (lastType.equals("L")) ? "L" : "EN";
        }
        else if (t.equals("R") || t.equals("L"))
        {
            lastType = t;
        }
    }


    for (int i = 0; i < strLength; ++i)
    {
        if (types[i].equals("ON"))
        {

            int end = FindUnequal(types, i + 1, "ON");

            String before = sor;
            if (i > 0)
            {
                before = types[i - 1];
            }

            String after = eor;
            if (end + 1 < strLength)
            {
                after = types[end + 1];
            }
            if (!before.equals("L"))
            {
                before = "R";
            }
            if (!after.equals("L"))
            {
                after = "R";
            }
            if (before == after)
            {
                SetValues(types, i, end, before);
            }
            i = end - 1; // reset to end (-1 so next iteration is ok)
        }
    }



    for (int i = 0; i < strLength; ++i)
    {
        if (types[i].equals("ON"))
        {
            types[i] = e;
        }
    }



    for (int i = 0; i < strLength; ++i)
    {

        String t = types[i];
        if (IsEven(levels[i]))
        {
            if (t.equals("R"))
            {
                levels[i] += 1;
            }
            else if (t.equals("AN") || t.equals("EN"))
            {
                levels[i] += 2;
            }
        }
        else
        { 
            if (t.equals("L") || t.equals("AN") || t.equals("EN"))
            {
                levels[i] += 1;
            }
        }
    }


    int highestLevel = -1;
    int lowestOddLevel = 99;
    int ii = levels.length;
    for (int i = 0; i < ii; ++i)
    {

        int level = levels[i];
        if (highestLevel < level)
        {
            highestLevel = level;
        }
        if (lowestOddLevel > level && IsOdd(level))
        {
            lowestOddLevel = level;
        }
    }



    for (int level = highestLevel; level >= lowestOddLevel; --level)
    {

        int start = -1;
        ii = levels.length;
        for (int i = 0; i < ii; ++i)
        {
            if (levels[i] < level)
            {
                if (start >= 0)
                {
                    chars = ReverseValues(chars, start, i);
                    start = -1;
                }
            }
            else if (start < 0)
            {
                start = i;
            }
        }
        if (start >= 0)
        {
            chars = ReverseValues(chars, start, levels.length);
        }
    }


    String result = "";
    ii = chars.length;
    for (int i = 0; i < ii; ++i)
    {

        char ch = chars[i];
        if (ch != '<' && ch != '>')
        {
            result += ch;
        }
    }

    return new BidiResult(result, isLtr);
}

Asking for software recommendations is off-topic for StackOverflow. Try [Software Recommendations SE](https://softwarerecs.stackexchange.com/) but make sure your question is [ontopic](https://softwarerecs.stackexchange.com/help/on-topic) there before posting — Sardar Usama, Jun 06 '18 at 00:17
Arabic text extraction *is* possible with iText. This question is a duplicate of https://stackoverflow.com/q/40596320/766786 — Amedee Van Gasse, Jun 06 '18 at 06:30
This question is marked as off topic for the wrong reason (asking for recommendations). Please vote to reopen so it can be closed again with the right reason: duplicate. — Amedee Van Gasse, Jun 06 '18 at 06:32
@AmedeeVanGasse i followed the link and make the same class in java but it is still doesn't recognize arabic text to extract . Do you know any way to extract arabic text from pdf ? — OsamaFawzy, Jun 07 '18 at 08:28
i found a user that said that he successfully able to extract arabic text from pdf but he isn't post his code and i can't comment to ask him about it cause it must has 50 reputation to comment , and this is the link for the user question https://stackoverflow.com/questions/37340410/extraction-of-arabic-text-from-itext-giving-text-from-arabic-presentation-set-b — OsamaFawzy, Jun 07 '18 at 09:24
Concerning your update: Have you tried using the `LocationTextExtractionStrategy` instead of the `SimpleTextExtractionStrategy`? If that does not help, please indicate which `Bidi.BidiText` method you use. It probably inverses too much... — mkl, Jun 08 '18 at 10:05
@mkl LocationTextExtractionStrategy doesn't work , i edit my code for method BidiText — OsamaFawzy, Jun 08 '18 at 21:49
Well, the code you provided is incomplete, e.g. what are `BidiResult`, `BaseTypes`, `ArabicTypes`, `IsOdd`, `FindUnequal`, and `SetValues`? That being said, have you checked whether the lines are in the correct order if you don't apply the `Bidi` code, i.e. if you use `text.append(data)` instead of `text.append(Bidi.BidiText(data,1).getText())`? — mkl, Jun 10 '18 at 19:49
@mkl i fixed the problem , but i have a question on another project i face this error while building the project and can't fix it , i searched many hours but i couldn't solve it , this is the error : Could not find support-vector-drawable.jar (com.android.support:support-vector-drawable:26.0.2). Searched in the following locations: https://jcenter.bintray.com/com/android/support/support-vector-drawable/26.0.2/support-vector-drawable-26.0.2.jar — OsamaFawzy, Jun 12 '18 at 11:16
Great that you fixed your problem. Concerning your other question: I have no idea, I practically do no Android development. You might want to make that a question here in its own right. — mkl, Jun 12 '18 at 11:34

score 0 · Accepted Answer · answered Jun 07 '18 at 14:27

Your example PDF does not contain any text at all, it merely contains an embedded bitmap image of text.

When talking about "text extraction from PDFs" (and "text extractor APIs" and PdfTextExtractor classes etc.), one usually means finding text drawing instructions in the PDF (for which a PDF viewer uses a font program either embedded in the PDF or available on the system at hand to display the text) and determining their text content from their string arguments and font encoding definitions.

As in your case there are no such text drawing instructions, merely a bitmap drawing instruction and the bitmap itself, text extraction from your document will return an empty string.

To retrieve the text displayed in your document, you have to look for OCR (optical character recognition) solutions. PDF libraries (like iText) can help you to extract the embedded bitmap image to forward to the OCR solution if the OCR solution does not directly support PDF but only bitmap formats.

If you also have PDF documents which display Arabic text using text drawing instructions with sufficient encoding information instead of bitmaps, you may need to post-process the text extraction output of iText with a method like Convert as proposed in this answer as pointed out by Amedee in a comment to your question. (Yes, it is written in C# but it is pretty easy to port to Java.)

great thanks to you sir it's working with me but it display the arabic text in reverse order — OsamaFawzy, Jun 08 '18 at 00:25
pdf text is : "السلام عليكم ورحمة الله وبركاته" the output is : "وبركاته الله ورحمة عليكم السلام" — OsamaFawzy, Jun 08 '18 at 01:11
the problem is solved but now i face another problem if the file has more than one line it display the last line and continue until the first line as it a stack first in last out — OsamaFawzy, Jun 08 '18 at 01:40
Which document and which text retrieval technique exactly are you using now as you have these problems? — mkl, Jun 08 '18 at 04:10

Arabic PDF Text Extractor

1 Answers1