Cross reference stream decoding

Question

I wrote a pdf parser with my own Deflate decoder. It works fine for all streams contents I encountered up to now, but it fails to decode the binary contents of a xref stream. Here are, in hexadecimal format, the bytes found after "stream" and before "endstream" :

68 DE 62 62 62 60 D8 CB F4 9F 79 F5 1F 20 83 A9 19 48 F0 AB 01 09 86 E9 00 01 06 00 46 3B 04 C4

After dropping the first two bytes (zlib header 0x68 0xDE), my Deflate decoder reads 26 bytes and stops, having produced 34 bytes on output :

02 02 00 00 BD 02 FF 03 AB FC 02 00 00 02 83 02 00 00 02 83 02 FF 26 02 83 O2 0F 26 02 83 02 0F 00 97

Here is the file: https://drive.google.com/open?id=1BHB0AAdAVA6EQuE-aPHrCXvdqU8uUWFP

The /Filter entry in the stream dictionary is set to /FlateDecode and, as far as I can see, this file is not encrypted. There is also a /DecodeParms entry set to a dictionary containing /Columns 4 /Predictor 12, but, if I'm right, it must be used after inflating the stream binary data.

The Deflate decoder produces 34 bytes on output but, even after applying a PNG Up filter, they don't make sense to me.

Thanks.

Please share the full PDF file you are talking about. – mkl Oct 06 '19 at 17:05 — mkl, Oct 06 '19 at 17:05
Did my answer clear up the issue for you? – mkl Oct 17 '19 at 14:23 — mkl, Oct 17 '19 at 14:23

score 1 · Answer 1 · answered Oct 11 '19 at 13:46

FLATE decoding that stream results in

02 02 00 00 BD 02 FF 03 AB FC 02 00 00 02 83 02 00 00 0F 26 02 00 00 00 97

Thus, your claim The Deflate decoder produces 34 bytes indicates that you are doing something wrong using your Deflate decoder.

Have you considered that PDF FLATE encoded streams do not contain the naked FLATE compressed data but instead are in the ZLIB Compressed Data Format wrapping FLATE compressed data? If your decoder expects the naked data, you first have to drop the ZLIB header. For details and references see this answer.

To finalize the decoding we have to resolve the PNG Up predictor use:

02 02 00 00 BD         02 00 00 BD
02 FF 03 AB FC         01 03 AB B9
02 00 00 02 83   -->   01 03 AD 3C
02 00 00 0F 26         01 03 BC 62
02 00 00 00 97         01 03 BC F9

Considering /Index[185 2 188 3] and /W[1 3 0], therefore:

Object 185 is in object stream 189
Object 186 starts at offset 0x03ABB9
Object 188 starts at offset 0x03AD3C
Object 189 starts at offset 0x03BC62
Object 190 starts at offset 0x03BCF9

Cross reference stream decoding

1 Answers1