I'm trying to read a PDF file that is linearized and uses cross-reference streams. I believe that I mostly understand what's happening except for the last two entries in the table. Those two, for objects 5 and 6, appear to be in use but show file offsets that vastly exceed the file size. Also, the PDF file I have doesn't even have objects number 5 or 6 in it.
Here is the cross-reference stream:
4 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<ED772C59D33BA74FA1DEE567740067A0><ED772C59D33BA74FA1DEE567740067A0>]/Info 6 0 R/Length 39/Root 8 0 R/Size 7/Type/XRef/W[1 3 0]>>stream
hfibb&F…ˆl&fit ¡ÿ"∏ôügÕ≤=‘
endstream
And here are the raw data after FlateDecode, arranged in rows. FlateDecode reports that 35 bytes of data were inflated.
02 00 00 00 00
02 01 19 87 6b
02 00 00 0d 67
02 00 00 01 8c
02 00 00 01 0b
02 01 e7 6a 99
02 00 00 00 01
I also applied a PNG Predictor function (up) which yielded 7 rows of 4 bytes each:
00 00 00 00
01 19 87 6b
01 19 94 d2
00 00 0e f3
00 00 02 97
01 e7 6b a4
01 e7 6a 9a
Row 0 is all zero, check. The offsets for object 1 and 2 do in fact address object 1 and 2 in the PDF file. So far, so good. Object 3 is marked unused, and for sure there is no object 3 in the PDF file.
But then, I'm a little confused that object 4, this cross-reference stream, is marked as unused. Still, since it is object 4 that I am parsing, I've clearly had no difficulty finding it.
But where I am completely confused are the rows for object 5 and 6. The "01" in the first column tells me that they are in use. But their offsets exceed the size of the entire file, and in any case, there are no object 5 nor 6 in the file. The Size entry in the dictionary clearly has a value of 7, telling me the table should contain data for objects 0 thru 6. After filtering, I have 28 bytes of data, which makes sense for seven rows of four bytes each.
Why are entries for 5 and 6 there at all? And, given that they are there, why are they marked as "in use" with apparently nonsense offsets?
The file seems valid. Both Adobe Illustrator and Acrobat Reader open it without complaint. I haven't found anything in the PDF spec about special treatment for the last two rows of an Xref stream. What am I missing?