1

The code below extracts the text from a PDF correctly via ITextSharp in many instances.

                using (var pdfReader = new PdfReader(filename))
                {
                    ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
                    var currentText = PdfTextExtractor.GetTextFromPage(
                        pdfReader,
                        1,
                        strategy);

                    currentText =
                        Encoding.UTF8.GetString(Encoding.Convert(
                            Encoding.Default,
                            Encoding.UTF8,
                            Encoding.Default.GetBytes(currentText)));

                    Console.WriteLine(currentText);
                }

However, in the case of this PDF I get the following instead of text: "\u0001\u0002\u0003\u0004\u0005\u0006\a\b\t\a\u0001\u0002\u0003\u0004\u0005\u0006\u0003"

I have tried different encodings and even PDFBox but still failed to decode the PDF correctly. Any ideas on how to solve the issue?

Joe
  • 221
  • 2
  • 11
  • 2
    Although this probably won't fix your problem, completely remove the `currentText = Encoding...` line [because it doesn't do what you think it does](http://stackoverflow.com/a/10191879/231316) – Chris Haas Nov 03 '15 at 14:06
  • I have read your prior work on it thank you @ChrisHaas. I decided to include it in this post since in all the previous IText questions it was normally offered as the first solution. Just my luck that you actually looked at this question! – Joe Nov 03 '15 at 14:25
  • 1
    I don't answer every question (I do have a day job ;) ) but I at least read every one! – Chris Haas Nov 03 '15 at 21:41

2 Answers2

3

Extracting the text nonetheless

@Bruno's answer is the answer one should give here, the PDF clearly does not provide the information required to allow proper text extraction according to section 9.10 Extraction of Text Content of the PDF specification ISO 32000-1...

But there actually is a slightly evil way to extract the text from the PDF at hand nonetheless!

Wrapping one's text extraction strategy in an instance of the following class, the garbled text is replaced by the correct text:

public class RemappingExtractionFilter : ITextExtractionStrategy
{
    ITextExtractionStrategy strategy;
    System.Reflection.FieldInfo stringField;

    public RemappingExtractionFilter(ITextExtractionStrategy strategy)
    {
        this.strategy = strategy;
        this.stringField = typeof(TextRenderInfo).GetField("text", System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.Instance);
    }

    public void RenderText(TextRenderInfo renderInfo)
    {
        DocumentFont font =renderInfo.GetFont();
        PdfDictionary dict = font.FontDictionary;
        PdfDictionary encoding = dict.GetAsDict(PdfName.ENCODING);
        PdfArray diffs = encoding.GetAsArray(PdfName.DIFFERENCES);

        ;
        StringBuilder builder = new StringBuilder();
        foreach (byte b in renderInfo.PdfString.GetBytes())
        {
            PdfName name = diffs.GetAsName((char)b);
            String s = name.ToString().Substring(2);
            int i = Convert.ToInt32(s, 16);
            builder.Append((char)i);
        }

        stringField.SetValue(renderInfo, builder.ToString());
        strategy.RenderText(renderInfo);
    }

    public void BeginTextBlock()
    {
        strategy.BeginTextBlock();
    }

    public void EndTextBlock()
    {
        strategy.EndTextBlock();
    }

    public void RenderImage(ImageRenderInfo renderInfo)
    {
        strategy.RenderImage(renderInfo);
    }

    public String GetResultantText()
    {
        return strategy.GetResultantText();
    }
}

It can be used like this:

ITextExtractionStrategy strategy = new RemappingExtractionFilter(new LocationTextExtractionStrategy());
string text = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

Beware, I had to use System.Reflection to access private members. Some environments may forbid this.

The same in Java

I initially coded this in Java for iText because that's my primary development environment. Thus, here the initial Java version:

public class RemappingExtractionFilter implements TextExtractionStrategy
{
    public RemappingExtractionFilter(TextExtractionStrategy strategy) throws NoSuchFieldException, SecurityException
    {
        this.strategy = strategy;
        this.stringField = TextRenderInfo.class.getDeclaredField("text");
        this.stringField.setAccessible(true);
    }

    @Override
    public void renderText(TextRenderInfo renderInfo)
    {
        DocumentFont font =renderInfo.getFont();
        PdfDictionary dict = font.getFontDictionary();
        PdfDictionary encoding = dict.getAsDict(PdfName.ENCODING);
        PdfArray diffs = encoding.getAsArray(PdfName.DIFFERENCES);

        ;
        StringBuilder builder = new StringBuilder();
        for (byte b : renderInfo.getPdfString().getBytes())
        {
            PdfName name = diffs.getAsName((char)b);
            String s = name.toString().substring(2);
            int i = Integer.parseUnsignedInt(s, 16);
            builder.append((char)i);
        }

        try
        {
            stringField.set(renderInfo, builder.toString());
        }
        catch (IllegalArgumentException | IllegalAccessException e)
        {
            e.printStackTrace();
        }
        strategy.renderText(renderInfo);
    }

    @Override
    public void beginTextBlock()
    {
        strategy.beginTextBlock();
    }

    @Override
    public void endTextBlock()
    {
        strategy.endTextBlock();
    }

    @Override
    public void renderImage(ImageRenderInfo renderInfo)
    {
        strategy.renderImage(renderInfo);
    }

    @Override
    public String getResultantText()
    {
        return strategy.getResultantText();
    }

    final TextExtractionStrategy strategy;
    final Field stringField;
}

(RemappingExtractionFilter.java)

It can be used like this:

String extractRemapped(PdfReader reader, int pageNo) throws IOException, NoSuchFieldException, SecurityException
{
    TextExtractionStrategy strategy = new RemappingExtractionFilter(new LocationTextExtractionStrategy());
    return PdfTextExtractor.getTextFromPage(reader, pageNo, strategy);
}

(from RemappedExtraction.java)

Why does this work?

First of all, this is not the solution to all extraction problems, merely for extracting text from PDFs like the OP has presented.

This method works because the names the PDF uses in its fonts' encoding differences arrays can be interpreted even though they are not standard. These names are built as /Gxx where xx is the hexadecimal representation of the ASCII code of the character this name represents.

Community
  • 1
  • 1
mkl
  • 90,588
  • 15
  • 125
  • 265
  • I already translated the code to C# and was just testing it! Thank you for the answer, I cant believe how simple it was once you highlighted the issue. – Joe Nov 04 '15 at 10:48
1

A good test to find out whether or not a PDF allows text to be extracted correctly, is by opening it in Adobe Reader and to copy and paste the text.

For instance: I copied the word ABSTRACT and I pasted it in Notepad++:

enter image description here

Do you see the word ABSTRACT in Notepad++? No, you see %&SOH'"%GS. The A is represented as %, the B is represented as &, and so on.

This is a clear indication that the content of the PDF isn't accessible: there is no mapping between the encoding that was use (% = A, & = B,...) and the actual characters that humans can understand.

In short: the PDF doesn't allow you to extract text, not with iText, not with iTextSharp, not with PDFBox. You'll have to find an OCR tool instead and OCR the complete document.

For more info, you may want to watch the following videos:

Bruno Lowagie
  • 75,994
  • 9
  • 109
  • 165
  • I will follow your links and watch the videos a bit later but when you view the PDF properties it states that Content copying is allowed – Joe Nov 03 '15 at 14:22
  • The fact that "Content copying" is allowed is irrelevant. (1.) That permission can be set using password encryption which is only psychological; that encryption can easily be removed. (2.) The content itself prevents you to copy it. If you don't believe me, I can elaborate. – Bruno Lowagie Nov 03 '15 at 14:24
  • Thank you for the information, clearly you know your stuff! – Joe Nov 04 '15 at 10:46