CGPDFScanner, Identity-H and decompression

Question

My instance of CGPDFScanner is scanning a test pdf file.

At a given time, the current font dictionary has Encoding value Identity-H and a FontDescriptor dictionary with key FontFile2. This key happens to be for a stream value, whose dictionary has the key Filter. The value for this key is FlateDecode.

I'm unsure of how to interpret and use this (to, say, extract the text in the next Tj block to Unicode). For example, do I just zlib-decompress the bytes in the next Tj block? (There is no ToUnicode key here.)

I'd thought all the decompression was carried out by the instance of CGPDFScanner.

score 0 · Answer 1 · answered May 18 '11 at 10:29

0

If the font uses Identity-H encoding and it does not have a ToUnicode entry, the text cannot be extracted. The parameter of Tj operator is a sequence of glyph indexes and this sequence cannot be converted to text in the absence of the ToUnicode entry.

The FontFile2 entry stores the actual font file, it has no role when extracting text from the PDF file.

answered May 18 '11 at 10:29

iPDFdev

5,229
2
17
18

I do not think they manage it in any way. Did you try to copy text from such a file? Adobe Acrobat copies and pastes blank characters in this situation. – iPDFdev May 18 '11 at 11:47
If you can upload the file somewhere, I can take a look at it. Just let me know what text I should look at. – iPDFdev May 18 '11 at 16:45

CGPDFScanner, Identity-H and decompression

1 Answers1