This requires modifications to the example PDFA_def.ps
script. My answer is based on the Ghostscript 9.27 and the version packaged with Debian 10, found at /usr/share/ghostscript/9.27/lib/PDFA_def.ps
or in the Ghostscript repository. You can use an updated version, it should function similarly. I will assume you have successfully edited the file to point to the correct path for the color profile (I will use ArgyllCMS's sRGB module)
We can fairly easily get any file embedded by following existing pdfmark embedding tutorials, such as this email or this related question (as you found). Note that the namespace push/pop isn't important, so I copied to the end of my PDFA_def.ps file this postscript code:
[/_objdef {fstream} /type /stream /OBJ pdfmark % create an object as a stream
[{fstream} << /Type /EmbeddedFile >> /PUT pdfmark % sets the stream's type
% use one of these two options to define the contents of the stream
%[{fstream} (Alternatively, inline text) /PUT pdfmark
[{fstream} MyFileToEmbed (r) file /PUT pdfmark
% define the embedded file fia the /EMBED pdfmark option
[/Name (my.txt)
/FS <<
/Type /Filespec
/F (my.txt)
/EF <<
/F {fstream}
>>
>> /EMBED pdfmark
[{fstream} /CLOSE pdfmark % close streams when done
Note that I used a file read from MyFileToEmbed
, which must be defined on the gs
command line as -sMyFileToEmbed=/path/to/my/file.txt
. If you just want plain text, uncomment the first option (and remove the second line that references MyFileToEmbed
). Spacing isn't generally important either.
This is presumably where you are, and this embeds the file, but isn't a valid PDF/A-3 as you say. Lets look at each of the errors in turn:
The file specification dictionary for an embedded file shall contain the F and UF keys
This is the easiest to take care of, simply add a /UF
key to the /FS
dictionary.
First, however a brief primer on the PostScript syntax (at least as used in this answer), as it's rather unusual, and even StackOverflow lacks syntax highlighting (I used the LaTeX highlighter in this post).
PostScript (PS) is a stack based language that (roughly speaking) executes right to left. What C-style languages would express as file("r", MyFileToEmbed)
is MyFileToEmbed (r) file
. Comments start with %
, and strings are not quoted but in parentheses. /foo
is a name, roughly equivalent to :foo
in Ruby, or 'foo
in a lisp. << ... >>
is a dict, like { ... }
in modern scripting languages. Here, {foo}
is instead a pdfmark named object1, and the pdfmark lines we will use start with a mark: [
(See the PostScript or pdfmark spec for more details).
With that knowledge in mind, we can update the code for /UF
(spec page 102):
/F (my.txt)
/UF (my.txt) % Add this. Unicode File, defined the spec at Table 44, 7.11.3
/EF <<
Running veraPDF shows we are now down to 3 errors, though we are duplicating (my.txt)
. Lets make a variable for it, either by adding a gs argument -sMyFileName=my.txt
, or, anywhere before usage:
/MyFileName (my.txt) def
% snip...
/F MyFileName % update to the varible
/UF MyFileName
% snip...
Now lets tackle the next error:
In order to enable identification of the relationship between the file specification dictionary and the content that is referring to it, a new (required) key has been defined and its presence (in the dictionary) is required.
This one is also straightforward, the /FS
needs an /AFRelationship
key (I can't find the spec, but here's the notes, page 6). That value can be:
Source, Data, Alternative, Supplement, EncryptedPayload, FormData, Schema or Unspecified. Custom values may be used where none of these entries is appropriate.
I will use Supplement here, but pick whichever is most appropriate for whatever you are embedding:
/F MyFileName
/UF MyFileName
/AFRelationship /Supplement % These lines can be in any order, but the key must be before the value
/EF <<
Checking with veraPDF, we are down to 2 errors. Nice! Next error to tackle:
The MIME type of an embedded file, or a subset of a file, shall be specified using the Subtype key of the file specification dictionary. If the MIME type is not known, the "application/octet-stream" shall be used.
This is slightly trickier. MIME types have a /
in them, and since spaces don't really matter in PS, /Type/Filespec
is just as valid as /Type /Filespec
. Thus, as the mime type must be a name, not a string, we can't simply say /text/plain
. Instead we'll need to use the cvn
function (spec, page 402), which converts strings to names (like (quote)
in lisps, or to_sym
in Ruby). Note that this parameter goes on the stream object's dictionary itself (spec, page 104, table 45):
[{fstream} << /Type /EmbeddedFile
/Subtype (text/plain) cvn % Our new addition (can be on the same line as above)
>> /PUT pdfmark
% equivalent to the Ruby-syntax: { :Type => :EmbeddedFile, :Subtype => cvn("text/plain") }
Down to our final A-3 validation error, and this one is a bit trickier:
The additional information provided for associated files as well as the usage requirements for associated files indicate the relationship between the embedded file and the PDF document or the part of the PDF document with which it is associated.
We need to split the EMBED from the definition, and update the document /Catalog
dictionary to also point to the definition with the /AF
key (notes, page 6). Create a new pdfmark dict object, and refactor the code:
[/_objdef {myfileinfo} /type /dict /OBJ pdfmark % create new object
% assign the new pdfmark object
[{myfileinfo} <<
/Type /Filespec
/F MyFileName
/UF MyFileName
/AFRelationship /Supplement
/EF <<
/F {fstream}
>>
>> /PUT pdfmark % refactored out of the following line
[/Name MyFileName /FS {myfileinfo} /EMBED pdfmark % updated embed line
% This line was moved from the end of the original PDFA_defs.ps to after our attachment code
[{Catalog} <</OutputIntents [ {OutputIntent_PDFA} ] /AF [{myfileinfo}] >> /PUT pdfmark
Now running this through Ghostscript we get a veraPDF-accepted valid PDF/A-3B with an arbitrary document attachment!
For completeness, here is the whole modified PDFA_def.ps
file, and the script I used to run it with. Note I've replaced most constants with variables. For multiple attachments you can add more copies of the code we added with more {fstream}
and {myfileinfo}
objects (with different names, obvously).
The final full listing of our modified PDFA_def_attach.ps:
%!
% This is a modified version of the Ghostscript 9.27 PDFA_def.ps file with
% that creates a PDF/A-3 file with an embedded attachment
% Define entries in the document Info dictionary :
/ICCProfile (/usr/share/color/argyll/ref/sRGB.icm) % Customize
def
[ /Title (My PDF/A-3 with an embedded attachment) /DOCINFO pdfmark % Customize
% Define an ICC profile :
[/_objdef {icc_PDFA} /type /stream /OBJ pdfmark
[{icc_PDFA}
<<
/N currentpagedevice /ProcessColorModel known {
currentpagedevice /ProcessColorModel get dup /DeviceGray eq
{pop 1} {
/DeviceRGB eq
{3}{4} ifelse
} ifelse
} {
(ERROR, unable to determine ProcessColorModel) == flush
} ifelse
>> /PUT pdfmark
[{icc_PDFA} ICCProfile (r) file /PUT pdfmark
% Define the output intent dictionary :
[/_objdef {OutputIntent_PDFA} /type /dict /OBJ pdfmark
[{OutputIntent_PDFA} <<
/Type /OutputIntent % Must be so (the standard requires).
/S /GTS_PDFA1 % Must be so (the standard requires).
/DestOutputProfile {icc_PDFA} % Must be so (see above).
/OutputConditionIdentifier (sRGB) % Customize
>> /PUT pdfmark
% New code starts here.
% If you want to not use Ghostscript command line arguments,
% then uncomment these variable definitions
%/MyFileName (my.txt) def % alternative to -sMyFileName=my.txt
%/MyFileToEmbed (/path/to/my/file.txt) def % alternative to -sMyFileToEmbed=/path/to/my/file.txt
%/MyMimeType (text/plain) def % alternative to -sMyMimeType=text/plain
% Define the embedded file objects
[/_objdef {myfileinfo} /type /dict /OBJ pdfmark
[/_objdef {fstream} /type /stream /OBJ pdfmark
% Load the file to embed
[{fstream} MyFileToEmbed (r) file /PUT pdfmark
% assign the stream information
[{fstream} <<
/Type /EmbeddedFile
/Subtype MyMimeType cvn
% /Params << % Optional, see Table 46, page 104 for options
% /Size 1234 % or use a -dMyVarName flag. -d defines numbers, -s Strings
% /ModDate (D:20211216) % see section 7.9.4, page 87 for full date format
% % etc...
% >>
>> /PUT pdfmark
% assign the file information
[{myfileinfo} <<
/Type /Filespec
% /Desc (My Optional Description) % optional, see page 103, table 44
/F MyFileName
/UF MyFileName
/AFRelationship /Supplement
/EF <<
/F {fstream}
>>
>> /PUT pdfmark
% Embed the stream
[/Name MyFileName /FS {myfileinfo} /EMBED pdfmark
[{fstream} /CLOSE pdfmark
% Updated last line from the original PDFA_defs.ps
[{Catalog} <</OutputIntents [ {OutputIntent_PDFA} ] /AF [{myfileinfo}] >> /PUT pdfmark
The command line (using GS 9.27):
gs -dPDFA=3 -sColorConversionStrategy=RGB -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -dPDFSETTINGS=/default -dAutoRotatePages=/All -sMyFileToEmbed=/tmp/test.png -sMyMimeType=image/png -sMyFileName=the_name_you_see_in_reader.png -dNOPAUSE -dBATCH -dQUIET -o output-a3.pdf PDFA_def_attach.ps input.pdf
(Note in more recent Ghostscript versions you'll need to add --permit-file-read=/tmp/test.png
)
1 Technically, as KenS helpfully pointed out in the comments, it's actually a PostScript procedure used in a non-standard way, but as this answer won't be using procedures outside of pdfmark named objects (excluding the already-written parts of PDFA_def.ps), I've glossed over that implementation detail here in favor of the name used in the pdfmark reference manual.