Text extraction is empty and unknown for text has type3 font using PDFBox,iText (difficult topic!)

Question

I have PDF file in Arabic that has text with font Type3 when I extract text using PDFBox some characters are empty and their font equals null? I want to know what is the problem?

code:

  protected void processTextPosition(TextPosition text) {
    String character=text.getCharacter(); // is empty
    String font=text.getFont().getBaseFont(); // equal null
}

stream produced with iText: ( dJ� v{d W�cG�)Tj

I speak about these question marks, why do I get the characters in this format?

These question marks appeared in my stream as "SOH-STX-ETX-EOT", not one character. The character inside PDF is shown as 'd' and 'J'!

*SOH-STX-ETX-EOT* - these are control code identifiers; the font in question seems to use byte values in the ASCII control char range but does not map them to a proper value in its **ToUnicode** map. Thus, your output still contains these values. — mkl, Feb 10 '14 at 09:00

score 3 · Answer 1 · answered Feb 10 '14 at 08:04

3

A Type 3 font is a user-defined font. For instance: a user can define that the character 'P' corresponds with the symbol for "The Artist Formerly Known As Prince" (TAFKAP) which is a glyph, but not a letter from any known alphabet.

A glyph in a Type 3 font is a series of lines and shapes, and there's no way for a program such as iText or PDFBox to determine which character was meant. It is only normal that you get a question mark. For instance: which character would you use for this symbol?

One of the following reasons applies for a PDF that contains Type 3 fonts:

The font was used to introduce symbols that don't exist in any font.
The font was used to obfuscate the content of the PDF so that its content can't be extracted.
The PDF wasn't created in an elegant way.

If the Type 3 font was used for normal characters, you'll need to use OCR to convert the content to normal text.

answered Feb 10 '14 at 08:04

Bruno Lowagie

75,994
9
109
165

1

The OP in a [recent question](http://stackoverflow.com/questions/21577850/getbasefont-equal-null-in-pdfbox) presented a sample PDF which also included a Type3 font for Arabic glyphs. That PDF *did* contain **Encryption** and **ToUnicode** font dictionary entries, but these both indicated that an encoding mostly for Latin characters and math symbols, especially not for Arabic characters. Thus, I assume the reason is either 2 (obfuscation) or 3 (disregard of text-extractability even though the framework used seems to support it). – mkl Feb 10 '14 at 08:53
OK, I didn't see that other question. I'm not following questions about PDFBox (for obvious reasons ;-) ) – Bruno Lowagie Feb 10 '14 at 08:55
Sorry what is these reasons? and I want to know the solution for this problem (no solution?) – Ayman Younis Feb 10 '14 at 15:35
Use OCR and hope the quality is sufficient. If not, there is no solution. – Bruno Lowagie Feb 10 '14 at 15:49

Text extraction is empty and unknown for text has type3 font using PDFBox,iText (difficult topic!)

1 Answers1

Linked