I am processing files and use magic numbers to identify file type validity.
I am using the Medsea mime-util JAR for Java to investigate the magic number and determine mime. This library accounts for two different PDF sequences it checks from left-to-right:
- standard PDFs:
%PDF-
- PDFs preceded with the UTF-8 Byte Order Mark (BOM):
\xef\xbb\xbf%PDF-
If the PDF does not start with either of those sequences, it is rejected.
I have been given the following file (see image) which opens validly in Acrobat and other viewers; I do not know what the Byte Order Mark (BOM) is for the value preceding the %PDF-.
255044462D
is %PDF-
Here is the HEX sequence with the unidentified BOM:
ACED0005757200025B42ACF317F8060854E0020000787000007CD4255044462D
Is this a valid BOM, and if so, how do I identify it?
UPDATE
Per the answer below, the solution is to check the first 1024 characters for the above sequence. I have solved this in the Medsea mime-util library by altering the magic.mime
file using an undocumented feature the in-line source code details.
Alter this entry:
0 string %PDF- application/pdf ignore pdf
as follows:
0 string>1024 %PDF- application/pdf ignore pdf
This undocumented feature is explained in a comment embedded in the source code of eu.medsea.mimeutil.detector.MagicMimeEntry.java
method readBuffer(byte[])
for MagicMimeEntry.STRING_TYPE
:
// The following is not documented in the Magic(5) documentation.
// This is an extension to the magic rules and is provided by this utility.
// It allows for better matching of some text based files such as XML files
The subsequent code demonstrates parsing a >#
section from the column 2 "type" value and using # for the buffer size to search, from start index indicated by the value for column 1.