1

I am processing files and use magic numbers to identify file type validity.

I am using the Medsea mime-util JAR for Java to investigate the magic number and determine mime. This library accounts for two different PDF sequences it checks from left-to-right:

  • standard PDFs: %PDF-
  • PDFs preceded with the UTF-8 Byte Order Mark (BOM): \xef\xbb\xbf%PDF-

If the PDF does not start with either of those sequences, it is rejected.

I have been given the following file (see image) which opens validly in Acrobat and other viewers; I do not know what the Byte Order Mark (BOM) is for the value preceding the %PDF-.

255044462D is %PDF-

Here is the HEX sequence with the unidentified BOM:

ACED0005757200025B42ACF317F8060854E0020000787000007CD4255044462D

Is this a valid BOM, and if so, how do I identify it?


UPDATE

Per the answer below, the solution is to check the first 1024 characters for the above sequence. I have solved this in the Medsea mime-util library by altering the magic.mime file using an undocumented feature the in-line source code details.

Alter this entry:

0    string    %PDF-    application/pdf    ignore    pdf

as follows:

0    string>1024    %PDF-    application/pdf    ignore    pdf

This undocumented feature is explained in a comment embedded in the source code of eu.medsea.mimeutil.detector.MagicMimeEntry.java method readBuffer(byte[]) for MagicMimeEntry.STRING_TYPE:

// The following is not documented in the Magic(5) documentation.
// This is an extension to the magic rules and is provided by this utility.
// It allows for better matching of some text based files such as XML files

The subsequent code demonstrates parsing a ># section from the column 2 "type" value and using # for the buffer size to search, from start index indicated by the value for column 1.

JoshDM
  • 4,939
  • 7
  • 43
  • 72

1 Answers1

2

Read this answer on a related topic:

According to the PDF standard (ISO 32000-2, similarly also already in ISO 32000-1):

The PDF file begins with the 5 characters “%PDF–”

(ISO 32000-2, section 7.5.2 "File header")

In particular there is nothing like "UTF-8 encoded PDFs (preceded with the UTF-8 Byte Order Mark)", already that BOM is invalid.

Nonetheless, Adobe Reader and other PDF viewers open files with a few leading arbitrary trash bytes as PDFs without complaint. This happens because Adobe Reader explicitly is lax about the specification

Acrobat viewers require only that the header appear somewhere within the first 1024 bytes of the file.

(Adobe PDF Reference sixth edition, appendix H.3 "Implementation Notes", item 13)

and other PDF viewers follow its lead.

Thus, if you want to use magic numbers to identify file type validity as in "valid according to the specification", you must only accept files beginning with the 5 characters “%PDF-”. On the other hand, if you want to judge validity by "opens in common viewers", you have to accept anything with “%PDF-” appearing somewhere within the first 1024 bytes of the file.

Even worse,

Acrobat viewers also accept a header of the form

%!PS−Adobe−N.n PDF−M.m

(Adobe PDF Reference sixth edition, appendix H.3 "Implementation Notes", item 14)

So in this case you also have to accept this sequence in the first 1024 bytes...


I didn't close your question as duplicate of the referenced answer because you appear to believe that there is something like "UTF-8 encoded PDFs", that some BOMs may be valid in front of the “%PDF-” – No, nothing is allowed in front of those header bytes, neither an UTF BOM nor anything else.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • "you appear to believe that there is something like UTF-8 encoded PDFs"; that's quite the disingenuous statement. I have encountered "in the wild" PDFs which have that exact sequence of characters before it, and it is part of the Java-based Medsea MIME parsing library, however I am changing my code to account for both the 1024-byte scan and the H.3 impl notes. – JoshDM Jul 02 '20 at 02:22
  • Yeah. The file is PDF and defines its own format: it seems text but it is not text, it is binary data (based on ASCII, but it is really binary data). Things are complex for the embedded fonts: some are Latin1 (few legacy fonts) and the rest are Unicode based (and no one can change the interpretation of such fonts). – Giacomo Catenazzi Jul 02 '20 at 10:23
  • 1
    Welcome to [PDF 2.0](https://github.com/pdf-association/pdf20examples/blob/master/PDF%202.0%20with%20offset%20start.pdf)... in which such "trash bytes" are allowed :-/ (see ISO/DIS 32000-2 - 7.5.2 File header) – Jan Slabon Jul 03 '20 at 08:03
  • 1
    @JanSlabon They are not *allowed* (*"The PDF file begins with the 5 characters “%PDF–”"*), NOTE 1 *does not **allow*** the trash bytes, it merely gives a hint how to deal with PDFs which (for which reasons ever) happen to have such preceding trash bytes if one does not simply want to reject them because of this error. – mkl Jul 03 '20 at 08:32
  • Ok, reading the texts with the opposite in mind, you are correct. Still strange that Matt demos exactly such case. IIRC this was also promoted as a "feature" at PDF Days 2017/2018. – Jan Slabon Jul 03 '20 at 13:16
  • The pdf specification unfortunately is a very bad specification, it has inherited much of the non-normative character of the old pdf references. Nonetheless, the specification is what's out in public. If the authors meant something different than they wrote, they'll have to accept what they wrote until they change it in some corrigenda. This is in particular true for any section changed in ISO 32000-2 because here they knew they work on an ISO specification and have to formulate as is common in specifications. – mkl Jul 04 '20 at 19:58
  • 1
    *"IIRC this was also promoted as a "feature" at PDF Days"* - anyone promoting that as a feature hardly can be taken seriously, can he? – mkl Jul 04 '20 at 20:15
  • @JanSlabon *"Still strange that Matt demos exactly such case."* - If you mean the `PDF 2.0 with offset start.pdf` example, please read the comment preceding the `%PDF` in that very PDF: In particular it correctly warns that PDF processors may reject this file as incorrect. Furthermore, it recommends not to add such data unless really needed, e.g. *in a print workflow when a print processor needs to write printer control data to select a PDF processing mode.* And even in that case I wouldn't call that whole file a valid PDF file but a print data file containing printer control data plus a PDF. – mkl Jul 07 '20 at 10:06
  • @mkl You got me in your first comment and I'm with you. I simply had these thing in my mind and backed by this example it even more looked as a "feature". It isn't. They simply described a way to handle such situation. – Jan Slabon Jul 07 '20 at 14:34
  • ;) ok. I was just curious which example that originally might have been, so I searched and looked into it. – mkl Jul 07 '20 at 15:02