Trying to understand data in cross-reference (XRef) stream in PDF

Question

I'm trying to read a PDF file that is linearized and uses cross-reference streams. I believe that I mostly understand what's happening except for the last two entries in the table. Those two, for objects 5 and 6, appear to be in use but show file offsets that vastly exceed the file size. Also, the PDF file I have doesn't even have objects number 5 or 6 in it.

Here is the cross-reference stream:

4 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<ED772C59D33BA74FA1DEE567740067A0><ED772C59D33BA74FA1DEE567740067A0>]/Info 6 0 R/Length 39/Root 8 0 R/Size 7/Type/XRef/W[1 3 0]>>stream

hﬁbb&F…ˆl&ﬁt ¡ÿ"∏ôügÕ≤=‘

endstream

And here are the raw data after FlateDecode, arranged in rows. FlateDecode reports that 35 bytes of data were inflated.

02 00 00 00 00
02 01 19 87 6b
02 00 00 0d 67
02 00 00 01 8c
02 00 00 01 0b
02 01 e7 6a 99
02 00 00 00 01

I also applied a PNG Predictor function (up) which yielded 7 rows of 4 bytes each:

00 00 00 00
01 19 87 6b
01 19 94 d2
00 00 0e f3
00 00 02 97
01 e7 6b a4
01 e7 6a 9a

Row 0 is all zero, check. The offsets for object 1 and 2 do in fact address object 1 and 2 in the PDF file. So far, so good. Object 3 is marked unused, and for sure there is no object 3 in the PDF file.

But then, I'm a little confused that object 4, this cross-reference stream, is marked as unused. Still, since it is object 4 that I am parsing, I've clearly had no difficulty finding it.
But where I am completely confused are the rows for object 5 and 6. The "01" in the first column tells me that they are in use. But their offsets exceed the size of the entire file, and in any case, there are no object 5 nor 6 in the file. The Size entry in the dictionary clearly has a value of 7, telling me the table should contain data for objects 0 thru 6. After filtering, I have 28 bytes of data, which makes sense for seven rows of four bytes each.

Why are entries for 5 and 6 there at all? And, given that they are there, why are they marked as "in use" with apparently nonsense offsets?

The file seems valid. Both Adobe Illustrator and Acrobat Reader open it without complaint. I haven't found anything in the PDF spec about special treatment for the last two rows of an Xref stream. What am I missing?

You interpret the predictor to add the current input row and the previous input row to retrieve the current data row. Shouldn't you add the current input row and the previous data row? That would change results for object 3 onward. — mkl, Jun 28 '19 at 13:15
@mkl That's what I think I did. My post shows the before and after of applying the PNG predictor. In the "before" data, each row has a predictor of 0x02. In the "after" row, there is no predictor byte, since the prediction has already been done. — Logicrat, Jun 28 '19 at 21:00
@MihaiIancu I can't post a link to the PDF today; I'm working on getting permission to do that but I'm not the owner of the data. However, this link (http://pastebin.com/521k2yGD) leads to a page that has data just concerning the file structure and the cross-reference streams in it. I hope it's useful. Thanks. — Logicrat, Jun 28 '19 at 21:05

score 3 · Accepted Answer · answered Jun 29 '19 at 07:25

You interpret the predictor to add the current input row and the previous input row to retrieve the current data row. Shouldn't you add the current input row and the previous data row? That would change results for object 3 onward:

02 00 00 00 00    00 00 00 00
02 01 19 87 6b    01 19 87 6b
02 00 00 0d 67    01 19 94 d2
02 00 00 01 8c    01 19 95 5e
02 00 00 01 0b    01 19 96 69
02 01 e7 6a 99    02 00 00 02
02 00 00 00 01    02 00 00 03

Now objects 3 and 4 have proper offsets matching the data from your pastebin paste and objects 5 and 6 would be marked as objects in object streams.

Trying to understand data in cross-reference (XRef) stream in PDF

1 Answers1

Linked