1

If I analyse multiple PDF files with a hex editor, I see that all of them have two trailers. That's possible if an object has changed or renewed (https://blog.idrsolutions.com/multiple-trailers-in-a-pdf-file/), but in my case, the PDF files are not edited. Does anyone know why all of the analysed files have two trailers?

This is a PDF file that contains a lot of text and also two images (there are two trailers in this file, who are (almost) identical to each other: :

0001a30bh: 74 72 61 69 6C 65 72 0D 0A 3C 3C 2F 53 69 7A 65 ; TRAILER..<</Size
0001a31bh: 20 34 37 2F 52 6F 6F 74 20 31 20 30 20 52 2F 49 ;  47/Root 1 0 R/I
0001a32bh: 6E 66 6F 20 31 35 20 30 20 52 2F 49 44 5B 3C 45 ; nfo 15 0 R/ID[<E
0001a33bh: 42 33 46 46 33 41 31 45 33 37 33 43 36 34 45 39 ; B3FF3A1E373C64E9
0001a34bh: 31 30 45 33 46 42 43 34 45 37 38 39 31 33 43 3E ; 10E3FBC4E78913C>
0001a35bh: 3C 45 42 33 46 46 33 41 31 45 33 37 33 43 36 34 ; <EB3FF3A1E373C64
0001a36bh: 45 39 31 30 45 33 46 42 43 34 45 37 38 39 31 33 ; E910E3FBC4E78913
0001a37bh: 43 3E 5D 20 3E 3E 0D 0A 73 74 61 72 74 78 72 65 ; C>] >>..startxre
0001a38bh: 66 0D 0A 31 30 36 33 32 33 0D 0A 25 25 45 4F 46 ; f..106323..%%EOF
0001a39bh: 0D 0A 78 72 65 66 0D 0A 30 20 30 0D 0A 74 72 61 ; ..xref..0 0..TRA
0001a3abh: 69 6C 65 72 0D 0A 3C 3C 2F 53 69 7A 65 20 34 37 ; ILER..<</Size 47
0001a3bbh: 2F 52 6F 6F 74 20 31 20 30 20 52 2F 49 6E 66 6F ; /Root 1 0 R/Info
0001a3cbh: 20 31 35 20 30 20 52 2F 49 44 5B 3C 45 42 33 46 ;  15 0 R/ID[<EB3F
0001a3dbh: 46 33 41 31 45 33 37 33 43 36 34 45 39 31 30 45 ; F3A1E373C64E910E
0001a3ebh: 33 46 42 43 34 45 37 38 39 31 33 43 3E 3C 45 42 ; 3FBC4E78913C><EB
0001a3fbh: 33 46 46 33 41 31 45 33 37 33 43 36 34 45 39 31 ; 3FF3A1E373C64E91
0001a40bh: 30 45 33 46 42 43 34 45 37 38 39 31 33 43 3E 5D ; 0E3FBC4E78913C>]
0001a41bh: 20 2F 50 72 65 76 20 31 30 36 33 32 33 2F 58 52 ;  /Prev 106323/XR
0001a42bh: 65 66 53 74 6D 20 31 30 35 39 37 32 3E 3E 0D 0A ; efStm 105972>>..
0001a43bh: 73 74 61 72 74 78 72 65 66 0D 0A 31 30 37 34 32 ; startxref..10742
0001a44bh: 31 0D 0A 25 25 45 4F 46                         ; 1..%%EOF

This is a PDF file that does only contain some random characters:

000071cbh: 74 72 61 69 6C 65 72 0D 0A 3C 3C 2F 53 69 7A 65 ; TRAILER..<</Size
000071dbh: 20 32 33 2F 52 6F 6F 74 20 31 20 30 20 52 2F 49 ;  23/Root 1 0 R/I
000071ebh: 6E 66 6F 20 39 20 30 20 52 2F 49 44 5B 3C 39 46 ; nfo 9 0 R/ID[<9F
000071fbh: 46 31 32 45 31 43 30 41 35 36 44 42 34 38 41 33 ; F12E1C0A56DB48A3
0000720bh: 41 31 43 37 32 30 33 38 32 33 30 32 45 32 3E 3C ; A1C720382302E2><
0000721bh: 39 46 46 31 32 45 31 43 30 41 35 36 44 42 34 38 ; 9FF12E1C0A56DB48
0000722bh: 41 33 41 31 43 37 32 30 33 38 32 33 30 32 45 32 ; A3A1C720382302E2
0000723bh: 3E 5D 20 3E 3E 0D 0A 73 74 61 72 74 78 72 65 66 ; >] >>..startxref
0000724bh: 0D 0A 32 38 36 35 39 0D 0A 25 25 45 4F 46 0D 0A ; ..28659..%%EOF..
0000725bh: 78 72 65 66 0D 0A 30 20 30 0D 0A 74 72 61 69 6C ; xref..0 0..TRAIL
0000726bh: 65 72 0D 0A 3C 3C 2F 53 69 7A 65 20 32 33 2F 52 ; ER..<</Size 23/R
0000727bh: 6F 6F 74 20 31 20 30 20 52 2F 49 6E 66 6F 20 39 ; oot 1 0 R/Info 9
0000728bh: 20 30 20 52 2F 49 44 5B 3C 39 46 46 31 32 45 31 ;  0 R/ID[<9FF12E1
0000729bh: 43 30 41 35 36 44 42 34 38 41 33 41 31 43 37 32 ; C0A56DB48A3A1C72
000072abh: 30 33 38 32 33 30 32 45 32 3E 3C 39 46 46 31 32 ; 0382302E2><9FF12
000072bbh: 45 31 43 30 41 35 36 44 42 34 38 41 33 41 31 43 ; E1C0A56DB48A3A1C
000072cbh: 37 32 30 33 38 32 33 30 32 45 32 3E 5D 20 2F 50 ; 720382302E2>] /P
000072dbh: 72 65 76 20 32 38 36 35 39 2F 58 52 65 66 53 74 ; rev 28659/XRefSt
000072ebh: 6D 20 32 38 33 37 34 3E 3E 0D 0A 73 74 61 72 74 ; m 28374>>..start
000072fbh: 78 72 65 66 0D 0A 32 39 32 37 35 0D 0A 25 25 45 ; xref..29275..%%E
0000730bh: 4F 46                                           ; OF

                                                                                      
Moooz
  • 15
  • 5

1 Answers1

0

Those files are most likely created by MS Word. The excerpts you posted look like their interpretation of hybrid reference PDFs.

There are two special constructs in which the PDF specification uses the mechanisms it introduced for incremental updates for something else:

  • Linearized PDFs (see ISO 32000-2:2020 Annex F) and
  • Hybrid-reference PDFs (see ISO 32000-2:2020 Section 7.5.8.4).

Your excerpts look like the latter type of PDFs.

Some backgrounds:

With PDF 1.5 Adobe introduced the option to collect multiple non-stream indirect objects in a stream, a so called "object stream". The advantage of doing so is that data in streams can be compressed while otherwise those object cannot be compressed. At the same time they also introduced the option to put the cross reference table data into streams, the so called "cross-reference streams", also to allow compression.

Obviously a new type of cross reference entry was necessary to describe indirect objects in object streams, so they defined entries of that kind, but only for the cross-reference streams, not for the old cross reference tables.

PDFs stored using object and cross-reference streams often indeed are much smaller than the same PDFs stored as regular indirect objects with cross reference tables. On the other hand PDF processors that were not aware of these techniques couldn't open these PDFs at all.

Thus, Adobe came up with the idea of hybrid files: Files that contain the basic objects in a PDF required to view it at all in the old-fashioned way and the objects for newer or optional features in object and cross reference streams. The trailers of the cross reference tables contain an entry XRefStm pointing to the cross reference stream.

For some reason, though, it was specified that object lookup first had to be attempted in the cross reference table, and only if no entry was found there for the object number in question, the associated cross reference stream was to be searched.

As the first cross reference table is required to cover the complete range of object numbers used, this lookup strategy implied that hybrid-reference files needed a second cross reference table whose trailer could point to the cross reference stream that would be used for lookups before the innermost, first cross reference table.

This is what we see in your example:

trailer
<</Size 47/Root 1 0 R/Info 15 0 R/ID[<EB3FF3A1E373C64E910E3FBC4E78913C><EB3FF3A1E373C64E910E3FBC4E78913C>] >>
startxref
106323
%%EOF
xref
0 0
trailer
<</Size 47/Root 1 0 R/Info 15 0 R/ID[<EB3FF3A1E373C64E910E3FBC4E78913C><EB3FF3A1E373C64E910E3FBC4E78913C>] /Prev 106323/XRefStm 105972>>
startxref
107421
%%EOF

Actually most PDF producers implemented hybrid-reference files (if they did at all) under the impression that the cross reference stream and probably also the object streams should go between the first trailer and the second cross reference table. But there is no requirement for that, and the PDF export of MS Office chose to put all the streams before the first cross reference table. As that's the case for your examples, too, I assume they were produced by MS Office.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Hi mkl, thank you for your extensive answer, and the files were indeed created from MS Word. I've one more question (but maybe I'm not allowed to ask multiple questions in one post, if that's the case my apologies). Both of the PDF files are contentwise completely different, but the length of the trailer is the same. Is it always the case that PDF files have, regardless of their size/content, the same trailer length? For example, as far as I know, it's possible that PDF versions older than 2.0 don't have a GUID, a trailer without a unique ID is shorter than a trailer that does have a GUID..? – Moooz Nov 10 '22 at 23:49
  • 1
    Indeed, as @KJ mentions, already different numbers of digits in the values of the Trailer entries will result in different lengths. Furthermore, **Info** is optional, the size of the **ID** strings may differ, sometimes there are additional trailer entries, there may be additional white spaces, and there may be comments. To get a feel for how trailers look like, you should look at pdfs by different producers, not only pdfs exported by MS Office. – mkl Nov 11 '22 at 04:44
  • @Moooz Furthermore... *"Both of the PDF files are contentwise completely different, but the length of the trailer is the same. Both of the PDF files are contentwise completely different, but the length of the trailer is the same."* - if you look closely, you'll notice that the lengths are not the same, merely close to each other. – mkl Nov 11 '22 at 09:34
  • @mkl Thanks, that makes sense :) The reason I was confused is because this paper (https://link.springer.com/content/pdf/10.1007/s00500-018-3257-z.pdf) uses a fixed length to analyse the trailer part of a PDF file (but I think that's because they want to distinguish between normally and maliciously encrypted files). – Moooz Nov 11 '22 at 21:45
  • 1
    Unfortunately i don't have access to that paper. But assuming a constant trailer length is simply wrong. – mkl Nov 12 '22 at 07:12
  • Hi mkl, one more question (I think you are a PDF specialist, considering your answers and profile, so I hope you can help me with this). I have studied/tried a few things, and one of the interesting things is that I've created a PDF file in Adobe, by converting a .txt to a .pdf. When I open this file in HEX, I see that it has no trailer. I have looked at it on the internet, but can't find much information about it. It looks like the trailer is embedded/encoded in the "stream". Could you please give your opinion on this? The thing is, I would like to find the trailer of any pdf file, with code – Moooz Nov 20 '22 at 23:59
  • @Moooz *"It looks like the trailer is embedded/encoded in the "stream". ... I would like to find the trailer of any pdf file, with code"* - Most likely you have a PDF with a cross reference stream instead of a cross reference table. In that case the trailer information is added to the cross reference stream dictionary. This option is available since pdf 1.5. It is also well-described in the specification. – mkl Nov 21 '22 at 07:36
  • @Moooz To find the trailer information, the specification expects you to do the following: 1. Find the **startxref** line near the end of the pdf. 2. Read the number from the next line. 3. Go to the position in the file given by that number. 4. If you find **xref** there, read the cross reference table; the trailer will be right behind it. 5. If instead you find an indirect object (starting with some *NNN M* **obj**), it is a cross reference stream; the trailer information are in its stream dictionary. – mkl Nov 21 '22 at 07:42
  • Thank you mkl, I will have a look at the PDF 1.5 specs, 1172 pages, so it will take some time ;) but at least I have more background information now, thanks! – Moooz Nov 21 '22 at 21:07
  • *"I will have a look at the PDF 1.5 specs"* - You had better look at ISO 32000, preferably part 2. The old Adobe PDF references were regarded as non-normative in nature. – mkl Nov 21 '22 at 21:26
  • Okay I will look at the ISO norm, one more thing mkl.. (I know that I have to look at this myself, and I will.. but just to know...), if the trailer is embedded in the cross reference stream, it doesn't contain the word "trailer": "the keywords xref and trailer are no longer used", is it in that case still possible to find the trailer (since the trailer is "hidden" in the cross reference stream)? or maybe I should formulate my question a bit differently.. I have to find the trailer without a manual action, so with some code I want to find the trailer of arbitrary PDF files. – Moooz Nov 21 '22 at 21:35
  • I mean I think (but please correct me if I'm wrong) that it's quite trivial to find the "startxref" section, but the trailer dictionary, whose entries are stored in the stream dictionary, is more difficult to differentiate – Moooz Nov 21 '22 at 23:01
  • *"is more difficult to differentiate"* - well, it is not really difficult. You take the number after **startxref**, go to that position in the file, and if there is an indirect object there, it's the stream in question. – mkl Nov 22 '22 at 09:56