8

Going through the PDF spec, it says that the trailer precedes the startxref. Which to me, says that the xref can appear anywhere in the document, but the trailer still appears before the startxref. This makes sense until you have to parse it, because you have to parse in reverse you can't take into account comments or strings. Lets get a little more wacky then.

trailer<< %\
  /Size 4 %\
  /Root 1 0 R %\
  /Info 4 0 R %\
  /Key (\
trailer<< %\
  /Size 4 %\
  /Root 2 0 R %\
  /Info 3 0 R %\
>>%)
>>&)
% test test )
startxref
 15
%%EOF

Which is a perfectly valid trailer. The first one is the real trailer, but the second one is in a "string". In this case, reverse parsing is going to fail to catch the comments. Looking for the string trailer is going to fail if its apart of a comment or string. I was wondering what the best way of finding out where the trailer starts is?

Update - This trailer seems to open in Acrobat Reader

%PDF-1.3
%âãÏÓ
xref
0 4
00000000 65535 f
00000110 00000 n
00000250 00000 n
00000315 00000 n
00000576 00000 n

1 0 obj <<
  /Type /Catalog
  /Pages 2 0 R
  /OpenAction [ 3 0 R /XYZ null null null ]
  /PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
  /Type /Pages
  /Kids [ 3 0 R ]
  /Count 1
>>
endobj
3 0 obj <<
  /Type /Page
  /Parent 2 0 R
  /Resources << >>
  /MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
  /Producer (Me)
  /CreationDate (D:20110626000000Z)
>>
endobj

trailer<< %\
  /Size 4 %\
  /Root 1 0 R %\
  /Info 4 0 R %\
  /Key (\
trailer<< %\
  /Size 4 %\
  /Root 2 0 R %\
  /Info 3 0 R %\
>>%)
>>%)
% test test )
startxref
 15
%%EOF

As far as syntax goes, this conforms to spec. Somehow they seem to be able to know if they are in a comment, or a string. Parsing L-R, the second trailer is in a string with a % tailed on, with a comment after the trailer. But R-L parsing, you have no idea if the first ) is part of a comment, or the end of a string definition.

Another Example:

%PDF-1.3
%âãÏÓ
xref
0 8
0000000000 65535 f
0000000210 00000 n
0000000357 00000 n
0000000428 00000 n
0000000533 00000 n
0000000612 00000 n
0000000759 00000 n
0000000830 00000 n
0000000935 00000 n

1 0 obj <<
  /Type /Catalog
  /Pages 2 0 R
  /OpenAction [ 3 0 R /XYZ null null null ]
  /PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
2 0 obj <<
  /Type /Pages
  /Kids [ 3 0 R ]
  /Count 1
>>
endobj
3 0 obj <<
  /Type /Page
  /Parent 2 0 R
  /Resources << >>
  /MediaBox [ 0 0 612 792 ]
>>
endobj
4 0 obj <<
  /Producer (Me)
  /CreationDate (D:20110626000000Z)
>>
endobj
5 0 obj <<
  /Type /Catalog
  /Pages 6 0 R
  /OpenAction [ 7 0 R /XYZ null null null ]
  /PageLabels << /Nums [0 << /S /D >> ] >>
>>
endobj
6 0 obj <<
  /Type /Pages
  /Kids [ 7 0 R ]
  /Count 1
>>
endobj
7 0 obj <<
  /Type /Page
  /Parent 6 0 R
  /Resources << >>
  /MediaBox [ 0 0 100 100 ]
>>
endobj
8 0 obj <<
  /Producer (Me)
  /CreationDate (D:20110626000000Z)
>>
endobj

trailer<< %\
  /Size 8 %\
  /Root 1 0 R %\
  /Info 4 0 R %\
  /Key (\
trailer<< %\
  /Size 8 %\
  /Root 5 0 R %\
  /Info 8 0 R %\
>>%)
>>%)
% test test )
startxref
 17
%%EOF

This example, is displayed correctly in Adobe. In my last case, you claimed it would fail because the "root" node is invalid, but this new sample, the root is valid, but its never actually used. So shouldn't it display a 100x100 window, instead of the 8.5"x11"?

In regard to the Resources

  (Required; inheritable) A dictionary containing any resources required by the page 
(see Section 3.7.2, “Resource Dictionaries”). If the page requires no resources, the 
value of this entry should be an empty dictionary. Omitting the entry entirely
indicates that the resources are to be inherited from an ancestor node in the page 
tree.
Rahly
  • 1,462
  • 13
  • 16
  • @Jeremy Walton: Look, the `trailer<<...>>` dictionary you constructed is probably not even a valid dictionary at all, therefor it's also not a valid *trailer* dictionary... According to the spec, the trailer dictionary must consist of *"a series of key-value pairs"*. The 'legal' keys in the trailer dictionary are limited to a quite narrow set, some of which are optional (see Table 15 in Section 7.5.5). – Kurt Pfeifle Jun 24 '11 at 16:49
  • Actually, it IS a valid dictionary entry, see the edit above. There are no such thing as "legal" keys. There are required, and there are optional, all other keys are ignored by the processor, but are parsed. – Rahly Jun 26 '11 at 08:35
  • 1
    @Jeremy Walton: you are splitting hair where hairsplitting doesn't lead to anything. Did you note the quotes around "legal"? Of course any sane parser should ignore "unknown", "illegal" keys. And I explicitely mentioned "required" and "optional" keyes and even linked to the table that exhaustively enumerates all of them. Go on splitting hair and be happy with it, I won't interfere any more. – Kurt Pfeifle Jun 26 '11 at 13:27
  • 1
    @Jeremy Walton: it is not a dictionary that makes any sense, because it does not contain a single key making any sense. It is not a valid trailer, because there is not a single of the required keys in it. – Kurt Pfeifle Jun 26 '11 at 13:28
  • Look at the update though. The initial one, i'll admit didn't have some keys, but it was an example. The updated one has a correct trailer. – Rahly Jun 26 '11 at 15:20
  • 2
    @Jeremy Walton: *"As far as syntax goes, this conforms to spec."* -- No. The xref table lines are too short. Also, the empty `/Resources` dictionary in the `/Page` object doesn't look kosher to me (but I didn't check with the spec). Correct this, and add some visible content to the file to see of Acrobat renders it. I've seen Acrobat rendering empty pages before, where there should have been visible marks, when the file was invalid, without Acrobat emitting any warnings. – Kurt Pfeifle Jun 26 '11 at 21:48
  • Yes resources is correct. See new example. – Rahly Jun 29 '11 at 05:40
  • 2
    I suggest you have two different sets of producer and creation date in your devils advocate PDF. It's also theoretically possible that Acrobat could decide to use 8.5"x11" when Things Go South... so Acrobat could just be choking on your file and you wouldn't know it. – Mark Storer Jun 29 '11 at 18:27
  • Well, I can't see them choking and someone else NOT, because Foxit, reads the trailer in the string, which is the wrong one. Then displays the 100x100 instead of the 8.5x11. But, Adobe displays the correct one. If its choking, its choking on the correct one. – Rahly Jun 30 '11 at 02:17

5 Answers5

4

The startxref statement usually is at the end of the file, with the trailer preceeding it.

Update: Above introductionary sentence was not clearly enough formulated, as Jeremy Walton correctly observed (though later comments in my answer hinted at the exceptions). It should have read: "The startref statement appears usually at the end of the file as a single instance, with the trailer preceeding it (unless your file has undergone incremental updates, in which case you may have different instances of cross-references with assorted trailers."

If there are comments sprinkled into the PDF, they count the same as "real" PDF page description code when it comes to byte counting for the xref table byte-offset calculations. Therefor, it is not a problem to parse it correctly.

To quote straight "from the horse's mouth" (PDF specification ISO 32000-1, Section 7.5.5):

"The trailer of a PDF file enables a conforming reader to quickly find the cross-reference table and certain special objects. Conforming readers should read a PDF file from its end. The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets [...]"

The key expression to take into account here is "LAST cross-reference section".

If you are having in mind updated trailers, then have a look at Section 7.5.6.

Yes, you have to parse in reverse. The first cross-reference section to read is the last one appearing in the file -- and it will have a preceding last trailer. The second one to read is the last-but-one appearing in the file -- with a preceding last-but-one trailer. Etc.pp.... If you'll have to read more than one trailer/xref section, each one you read has to contain a reference to the next one to read.

Should you think of "comments" being something you can freely insert into the PDF without corrupting its structure: then think different. Once you inserted comments, you have to update at least the xref table (and maybe the /Length keys of objects).


Update 2: The trailer<<...>> dictionary Jeremey constructed is probably not even a valid dictionary at all, therefor it's also not a valid trailer dictionary...

Anyway, according to the spec, the trailer dictionary must consist of "a series of key-value pairs". The 'legal' keys in the trailer dictionary are limited to a quite narrow set, some of which are even optional (see Table 15 in Section 7.5.5).

Jermey seems to have constructed his example in a way so to (mis-)understand this snippet as a potentially valid trailer dictionary:

trailer<<%) >>
% test test )

Which of course isn't a dictionary at all, since we don't see any key-value pair here.

His full example also isn't valid either because the "key" called /Key isn't amongst the valid key names for the trailer (which are, according to table 15: /Size, /Prev, /Root, /Encrypt, /Info, /ID, /XRefStm).

So Jeremy should do in his PDF parsing code the same that all sane and even most insane PDF processing libraries do: give up on obviously invalid constructs instead of searching sense in them and tell the user that "your damn PDF is corrupt because we cannot identify valid keys in the supposed trailer section of the file".

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • From the spec, startxref is ALWAYS at the end of the file, but there could be more than one, due to updates. Finding the cross-reference section is not a problem, its finding the "trailer dictionary". aka "trailer << >>". My issue is finding the trailer keyword reading backwards, because you have no idea if the keyword is part of a comment, or part of a string, reading backwards. Only forwards. If the trailer object was in the xref table, because it has a defined starting point, I don't think I'd have an issue. – Rahly Jun 24 '11 at 09:09
  • Also, comments in the trailer dictionary, would never affect the xref table because its after all known objects. – Rahly Jun 24 '11 at 09:09
  • Why would you mix the keyword with a comment? You are just asking for trouble if you start experimenting with this stuff. Keep it simple. Your PDF will need to be opened by hundreds of different applications, not just Adobe. Listen to these guys, they do this stuff for a living. – Rowan Jun 24 '11 at 11:27
  • @Rowan. I'm not trying to CREATE pdf's, i'm trying to READ them. According to the spec, this can happen, not that it should or would by most people. Which means a parser would have to account for it. Reading backwards you can't tell if you are in a string or comment(which can appear just about anywhere but the `startxref`), so searching for the keyword `trailer`, unless you are reading from the front of the file, forward. – Rahly Jun 24 '11 at 14:42
  • Actually, for your new update, here is the trailer "key" with a value of "trailer<<%" in a string, all keys don't have to be "valid" as custom keys can exist for specialized values for readers, custom key/value pairs are ignored by readers that don't support them – Rahly Jun 25 '11 at 23:04
  • @Jeremy Walton: you are trying to unnecessarily lecture me about the obvious things that I don't at all deny, ignore or "not-know", while you are rejecting hints about some other obvious things you seem to be in plain denial of... – Kurt Pfeifle Jun 26 '11 at 13:33
  • No, i'm trying to tell you, that you admit to parse in reverse, not look for "trailer". And then proceed to tell me, that you can tell if its a valid trailer, because it'll be followed by a valid dictionary. Which means, you are telling me find `trailer` first and then parse FORWARD to find a valid dictionary. Parsing in reverse would NOT tell me that the first trailer is not a real one, parsing forward would. But you can't parse forward unless you start at the beginning, because of strings/comments, unless you look for string and move forward. – Rahly Jun 26 '11 at 15:29
  • @Jeremy Walton: *Sigh!* No, I do not need to *"admit"* that one needs to parse a PDF file "from the end". Because I know it. It's in the spec. **Of course** (!!!), after you found a 'trailer' candidate, you need to read a few lines forward to verify if it is what you wanted to find. If not, try again + continue reading in reverse direction for the next 'trailer' candidate, or give up.... – Kurt Pfeifle Jun 26 '11 at 16:49
  • If you see the example above, the second trailer, which you will hit first, looks correct and parses correct. You'd have to validate the entire document, to know if the trailer is the real one, or you need to look back more – Rahly Jun 26 '11 at 18:49
  • @Jeremy: Yes, the 2nd trailer will be encountered 1st + will parse correctly. The parser then would have to look at the PDF "root" object. The correctly looking (but invalid) trailer will send the parser to look at `2 0 obj`. But when parsing `2 0 obj` it will find `/Type /Pages` instead of `/Type /Catalog` (which it should). Hence, the parser could deem trailer invalid + continue searching for a "better" trailer, or it could give up and emit an error message. -- In any case, your constructed example doesnt really stretch Adobe's tolerance, since it's only an empty page without content stream. – Kurt Pfeifle Jun 26 '11 at 21:37
  • You are right, but like i said with the 3rd example, the entire tree parses correctly, but shouldn't – Rahly Jun 30 '11 at 02:41
4

Q: Doc, it hurts when I do this.
A: Don't do that.

The correct way to parse the end of a PDF goes something like this:

  1. Find the last startxref
  2. Back up to that byte offset and start parsing xref table entries
  3. After the last xref table, parse out the trailer.

You don't really have to parse out the object numbers and byte offsets and so forth if you're just trying to find the trailer. All you need to do is look to see how many entries are in a given subsection of the xref, skip 20*N bytes, and check for another subsection (or "trailer"). When you finally hit "trailer" instead of numbers, you're there.

So why on Earth do you just want the trailer?


When I when hunting through the PDF Reference, I expected to find some line of text stating that the header/body/xref/trailer had to be in that order. I did not.

What I DID find, was this:

A basic conforming PDF file shall be constructed of following four elements (see Figure 2):
- A one-line header...
- A body...
- A cross-reference table...
- A trailer...

There are bullets in front of these sections, not numbers.

So that all hints that a conforming PDF can get away with swapping the order of the body and xref. On the other hand, the header is required to be first, the trailer is required to be last, and all the section of a PDF are listed in that order. This implies order, but won't hold up in court.

But if you look at Figure 2 (of chapter 7, section 5.1), entitled "Initial Structure of a PDF file", you'll see the order defined visually. That's a tad thin, but I'll cling to it anyway.

I wouldn't be at all surprised to find that a PDF that put its body after the xref table broke some PDF viewers (particularly a malformed PDF where the program tried to fix it).

I've been working with PDF files for well over a decade. In all that time, I have never seen a PDF where the xref came before the body. And I've seen some REALLY screwed up PDFs.

So while my "correct way to parse a PDF" may not be Iron Clad, it's still pretty durable.


And if you absolutely insist on backing up to find the keyword "trailer", then you can look for "close an array or dictionary" tokens after you parse out the trailer you found. If it were wrapped in a string, all the name slashes would have to be escaped, leading to Bad Parsing. You can't have spaces in a Name... so that leaves just array and dictionary.

But the odds of you ever encountering this problem in Real Life are astronomically small, unless you set out to break PDF software and create these PDFs yourself. That would bring your motives into question.

Mark Storer
  • 15,672
  • 3
  • 42
  • 80
  • You don't seem to really understand, but thats ok. You can't parse the trailer after the `xref`, because it might not appear after the `xref`, normal objects can. See my example. Notice the `xref` at the beginning of the file? If the xref were at the end of the file, with the trailer, there would be no need for `startxref`. I don't want JUST the trailer. I want to know how to parse the trailer CORRECTLY. – Rahly Jun 29 '11 at 05:03
  • Right, but the problem here is that according to the spec they do not have to be in that order. Also name slashes do NOT have to be escaped, only backslashes. \n, \r, \t, \b, \f, unbalanced parenthesis \\(, \\) \\ and \ddd where ddd=octal, no forward slash. pg 54 of the standard. In fact, according to the standard, \/ would be invalid, in a string. And I'm sorry, that I'd rather do whats correct than with what "works" – Rahly Jun 30 '11 at 02:35
  • My issue is that, yes the trailer is last, BUT since it allows comments in it, that it is almost completely impossible to find it, based on the spec. Because it allows not only comments but strings to be contained within a trailer/dictionary. You have to read backwards to find it, but tokenizing backwards is almost impossible. – Rahly Jun 30 '11 at 02:44
  • Then I add a question, If adobe (makers of the spec), parse it 100% correct according to the tokenized aspec of the spec, I don't have to since 99.999% of the pdfs conform a more ridged spec than the spec? – Rahly Jun 30 '11 at 02:48
  • It's actually HELPFUL to write your objects first, and then your xref. That way you don't have to serialize everything in memory to determine the byte offsets you need to build the xref in the first place. You're making this Much Harder than it needs to be. – Mark Storer Jun 30 '11 at 21:51
  • I'm processing PDFs, not creating them. And handling most people's IMPLEMENTATION of the standard is not the same as handling the actual standard. The standard says that this CAN happen, by giving the format they've described. Saying I can't parse something because it doesn't conform to the standard, is acceptable to me. But having a file that does in fact conform to the standard, but I can't read it because of MY bad implementation, is not. – Rahly Jun 30 '11 at 22:49
  • Then I guess you're just screwed, ain'cha? – Mark Storer Jul 01 '11 at 00:06
  • So you don't actually know the answer, and respond with derogatory remarks? Nice. If you don't know something, you should just either say so, or not respond at all. At least pipitas, is giving me examples of possible ways to parse it, even if i am able to create examples that follow the spec, but fail his methods. If your answer is the 100% correct way to do it, then show me how my example fails the specification. "No one does it that way" or "I've never seen it done that way" is not an acceptable answer, you've never seen every PDF out there. – Rahly Jul 01 '11 at 01:35
  • **You've made a key mistake.** You're confusing "What Acrobat will accept" with "what the spec says". Adobe has worked hard so that their programs will accept just about any hunk-of-broken-crap PDF that comes down the pike. I assert (based on the aforementioned Figure 2) that your PDF is one of those hunks of broken crap. – Mark Storer Jul 05 '11 at 16:19
  • No, I'm basing it off of what the spec says it can handle with what adobe does handle, and how it handles it. The fact that I can write a trailer that conforms 100% to the spec, but only Acrobat can read it, doesn't really make sense. – Rahly Jul 05 '11 at 19:56
  • Note: Adobe handles trailers moved inside the body, doesn't conform to the spec, not worrying about it. But comments/strings in a trailer, are supported by the spec, and reverse parsing screws this up. I figured out how adobe gets around this, but always parsing forward except for the last "comment" and startxref keywords. – Rahly Jul 05 '11 at 20:06
  • They may reverse-parse out lines of text. Come to think of it, I'm not so sure PS comments are supported in Arbitrary Locations. Lemme check... And they are. So you should be able to parse out the lines, sans comments, and go from there. – Mark Storer Jul 08 '11 at 23:52
  • except parsing backwards you don't know if ) is the end of a string, or its in a comment, same with a %, you don't know if its a % in a string, in which case, its not a comment, but a character in the string. If strings were not multilined, this probably wouldn't be an issue. Parsing forward, doesn't have this issue. They actually forward parse, based on the last parsable object location. – Rahly Jul 09 '11 at 01:25
2

Jeremy has repeatedly edited his question and example code. This made my original answer and some of my original comments partially invalid and missing the point.

Fact is (and a well-known one amongst people in the prepress trade and industry): Adobe does in quite a few instances silently and without a warning process and display PDF files which do not pass a strict validity checker.

Jeremy seems to have constructed such a case. His latest example would make any PDF parser interprete the following snippet as being the trailer (I stripped comments):

trailer<<
  /Size 4
  /Root 2 0 R
  /Info 3 0 R
>>

However, taking the info in this trailer will lead to the parser looking for the /Root at object 2 (while object 2 in fact is of /Type /Pages when it should be of /Type /Catalog for being the root object).

As a consequence, the PDF interpreter would have to

  • (a) either continue searching for another instance of a trailer on the chance that the next one does contain legitimate PDF info,
  • (b) or give up on processing the file and throw an error.

Adobe seems to follow alternative (a).

Ghostscript seems to follow alternative (b).


Note, that according to my byte-counting, Jeremy's PDF example has one more problem: its xref-table is invalid. It has only 16 bytes per line instead of 20. From the PDF spec document:

[....] the cross-reference entries themselves, one per line. Each entry shall be exactly 20 bytes long, including the end-of-line marker. There are two kinds of cross-reference entries: one for objects that are in use and another for objects that have been deleted and therefore are free. Both types of entries have similar basic formats, distinguished by the keyword n (for an in-use entry) or f (for a free entry). The format of an in-use entry shall be:

nnnnnnnnnn ggggg n eol

where:

nnnnnnnnnn shall be a 10-digit byte offset in the decoded stream
ggggg shall be a 5-digit generation number
n shall be a keyword identifying this as an in-use entry
eol shall be a 2-character end-of-line sequence

The byte offset in the decoded stream shall be a 10-digit number, padded with leading zeros if necessary, giving the number of bytes from the beginning of the file to the beginning of the object.

So to make Jeremy's xref table a valid one, it should be padded with 2 more leading '0' and read:

xref
0 4
0000000000 65535 f 
0000000110 00000 n 
0000000250 00000 n 
0000000315 00000 n 
0000000576 00000 n 

However, adding these 2 '0' to each xref line, also offsets each object by 10 more bytes, so the nnnnnnnnnn figures should also be corrected (being lazy, I didn't do it).

So Acrobat did open the constructed file of Jeremy (without any warning)

  • (1) despite the invalid trailer definition, and
  • (2) despite of the glaringly un-compliant xref table.

This adds two more proofs to what I stated in my second paragraph: Adobe's PDF parsing accepts files which violate Adobe's own PDF standard.

This is unfortunate. It lets get away lazy developers writing sloppy code which emits non-compliant PDF files without punishment. The fact that Adobe doesn't outright reject such crappy files may be in the interest of "user friendlyness", but promotes violations to the standard. At the very least, Adobe should always issue warnings when encountering such stuff.

Since Jeremy seems to go writing a PDF parser that wants to cover all corner cases, his users should hope that he at least warns them if it encounters shitty PDFs.

In any case: I've seen a lot of uncompliant PDF files emitted by crappy PDF generators. But so far I never encountered one which had comments sprinkled into its trailer section. So trying to cover corner cases should possibly start with lower hanging fruits than this.

Kurt Pfeifle
  • 86,724
  • 23
  • 248
  • 345
  • So I'm guessing what you are saying, is that you have to test the entire document tree and if that fails, continue parsing for the trailer. I've done some further testing. By placing a "fake" document structure within the first but never referenced, except by the trailer in the string, Adobe STILL finds the correct one. What I did was create two documents within a pdf, but the second one is never referenced by the trailer, each page had a different size. fake has 100x100 page size. And it still parsed the correct trailer. – Rahly Jun 29 '11 at 05:12
  • Pure speculation, but let's ask: why would the company that is the 'god' of PDF not only go to lengths to process invalid PDF files, but do it silently? Well... what if some Adobe products produced such invalid files - would that be a sufficient reason? – Spike0xff May 25 '16 at 17:27
2

I think I have found the solution. After extensive testing, and other things, with Adobe, I have found that what adobe does, is find the last known construct that can be parsed, and work from there, forward. Then it finds the last trailer that can be parsed correctly. So even if there is a correct root node that in trailer before the last valid trailer that can be parsed, if the root in the last trailer is invalid, it'll still fail. Would also be good to note, that this is still token based parsing forward. as trailers between () are ignored, so are trailers between stream/endstream's unless that stream has an invalid length, or a length specified in an obj after the stream (as these objects are not specified in the xref table). Now Adobe seems to take it that extra step further, by actually finding trailers in "gaps" in the xref table as well, this doesn't conform to the current spec model, as trailer is found at the end, and not in the body or xref table. So what I think is the best model, is to get the largest offset of the xref table, and the location of the xref table, if the xref table is after largest offset of an object, then use that, and work forward from there. This will allow me to correctly parse strings and comments without worrying. Thanks for everyone's help in this matter. Hopefully this helps people build a more robust PDF parser as well.

Rahly
  • 1,462
  • 13
  • 16
1

The trailer dictionary follows the xref section. Based on the startxref value, you jump to the beginning of xref section. After you read the xref section, you will reach the trailer dictionary. The trailer keyword is always the first on its line (white spaces are allowed in front of it). PDF files allow incremental updates, so you can encounter PDF files with multiple xref sections and trailers, but the processing rule is the same, first process the xref section and then the trailer. If the file includes incremental updates, the trailer section will include a reference to the previous xref section.

iPDFdev
  • 5,229
  • 2
  • 17
  • 18
  • but according to the spec, the xref can appear at the beginning of the file after %PDF-1.?, then filled with objects, then the trailer, startxref, and %%EOF. Creating a test file with this seems to render in Adobe without errors. – Rahly Jun 24 '11 at 08:58
  • also, the spec says that the trailer precedes the startxref, not succeeds the xref. So it doesn't matter where the xref is in the file, the trailer dictionary is still before the startxref – Rahly Jun 24 '11 at 09:13
  • @Jeremy, what displays successfully in Adobe Reader is not a good indicator of what is technically correct according to the PDF specification. Adobe is very lenient. If it encounters a malformed PDF it will attempt to repair it. If it encounters a PDF that does not conform 100% to the PDF specification, it will attempt to display it regardless. So... don't try to match it up to the PDF spec. – Rowan Jun 24 '11 at 11:19
  • Yes, you are correct regarding the spec. But practically all the PDF files I've seen are constructed this way: xref section, trailer, startxref. This can be used to optimize the parsing. Otherwise you read the trailer keyword backwards and when you find it, you keep reading till the beginning of the line. If you find only whitespaces, you have a valid trailer keyword. – iPDFdev Jun 24 '11 at 11:26
  • Right, but I'm reading PDFs, not constructing them. I don't want to parse "easiest for me", if thats not going to process all valid pdfs according to the spec. – Rahly Jun 24 '11 at 14:32
  • @Jeremy Walton: You are right: the Adobe specs are not always unequivocal and without potential for misundestanding. But you are wrong: their description of the trailer section is quite clear. Backed up by the generalized example for the "overall structure". "preceeding" here does not mean "anywhere in the file before the startxref" but means "directly before the startxref". – Kurt Pfeifle Jun 24 '11 at 17:10
  • @Jeremy Walton: You strife to create a PDF parser that even digests corner case files without hiccups. Good. But in that case don't rely on the `trailer` keyword only, but do also check if the trailer contains a valid dictionary using valid key names. – Kurt Pfeifle Jun 24 '11 at 17:12
  • I would rather parse backwards by tokens, but allowing comments really screws with this. – Rahly Jun 30 '11 at 02:38