1

I can manage to create a PDF/A-3 using Ghostscript's PDFA_def.ps file from a normal PDF, but similar to this answer for per-page embedded files, any non-PDF embedded attachments are stripped. I found a way to embed files in this SO post, but it generates a PDF that fails PDF/A-3 validation. veraPDF reports the following 4 errors:

The additional information provided for associated files as well as the usage requirements for associated files indicate the relationship between the embedded file and the PDF document or the part of the PDF document with which it is associated.
CosFileSpecification
isAssociatedFile == true

In order to enable identification of the relationship between the file specification dictionary and the content that is referring to it, a new (required) key has been defined and its presence (in the dictionary) is required.
CosFileSpecification
AFRelationship != null

The MIME type of an embedded file, or a subset of a file, shall be specified using the Subtype key of the file specification dictionary. If the MIME type is not known, the "application/octet-stream" shall be used.
EmbeddedFile
Subtype != null && /^[-\w+.]+/[-\w+.]+$/.test(Subtype)

The file specification dictionary for an embedded file does not contain either F or EF key.
CosFileSpecification
EF_size == 0 || (F != null && UF != null)

How can I either avoid stripping or just re-attach embedded whole-document files to make a valid PDF/A-3B file via Ghostscript?

byteit101
  • 3,910
  • 2
  • 20
  • 29

1 Answers1

1

This requires modifications to the example PDFA_def.ps script. My answer is based on the Ghostscript 9.27 and the version packaged with Debian 10, found at /usr/share/ghostscript/9.27/lib/PDFA_def.ps or in the Ghostscript repository. You can use an updated version, it should function similarly. I will assume you have successfully edited the file to point to the correct path for the color profile (I will use ArgyllCMS's sRGB module)

We can fairly easily get any file embedded by following existing pdfmark embedding tutorials, such as this email or this related question (as you found). Note that the namespace push/pop isn't important, so I copied to the end of my PDFA_def.ps file this postscript code:

[/_objdef {fstream} /type /stream          /OBJ   pdfmark % create an object as a stream
[{fstream} << /Type /EmbeddedFile >>       /PUT   pdfmark % sets the stream's type

% use one of these two options to define the contents of the stream
%[{fstream} (Alternatively, inline text)   /PUT   pdfmark
[{fstream} MyFileToEmbed (r) file          /PUT   pdfmark

% define the embedded file fia the /EMBED pdfmark option
[/Name (my.txt)
   /FS <<
     /Type /Filespec
     /F (my.txt)
     /EF <<
       /F {fstream}
     >>
   >>                                      /EMBED pdfmark
[{fstream}                                 /CLOSE pdfmark % close streams when done

Note that I used a file read from MyFileToEmbed, which must be defined on the gs command line as -sMyFileToEmbed=/path/to/my/file.txt. If you just want plain text, uncomment the first option (and remove the second line that references MyFileToEmbed). Spacing isn't generally important either.

This is presumably where you are, and this embeds the file, but isn't a valid PDF/A-3 as you say. Lets look at each of the errors in turn:

The file specification dictionary for an embedded file shall contain the F and UF keys

This is the easiest to take care of, simply add a /UF key to the /FS dictionary.

First, however a brief primer on the PostScript syntax (at least as used in this answer), as it's rather unusual, and even StackOverflow lacks syntax highlighting (I used the LaTeX highlighter in this post). PostScript (PS) is a stack based language that (roughly speaking) executes right to left. What C-style languages would express as file("r", MyFileToEmbed) is MyFileToEmbed (r) file. Comments start with %, and strings are not quoted but in parentheses. /foo is a name, roughly equivalent to :foo in Ruby, or 'foo in a lisp. << ... >> is a dict, like { ... } in modern scripting languages. Here, {foo} is instead a pdfmark named object1, and the pdfmark lines we will use start with a mark: [ (See the PostScript or pdfmark spec for more details).

With that knowledge in mind, we can update the code for /UF (spec page 102):

     /F (my.txt)
     /UF (my.txt) % Add this. Unicode File, defined the spec at Table 44, 7.11.3
     /EF <<

Running veraPDF shows we are now down to 3 errors, though we are duplicating (my.txt). Lets make a variable for it, either by adding a gs argument -sMyFileName=my.txt, or, anywhere before usage:

/MyFileName (my.txt) def
% snip...
     /F MyFileName % update to the varible
     /UF MyFileName
% snip...

Now lets tackle the next error:

In order to enable identification of the relationship between the file specification dictionary and the content that is referring to it, a new (required) key has been defined and its presence (in the dictionary) is required.

This one is also straightforward, the /FS needs an /AFRelationship key (I can't find the spec, but here's the notes, page 6). That value can be:

Source, Data, Alternative, Supplement, EncryptedPayload, FormData, Schema or Unspecified. Custom values may be used where none of these entries is appropriate.

I will use Supplement here, but pick whichever is most appropriate for whatever you are embedding:

     /F MyFileName
     /UF MyFileName
     /AFRelationship /Supplement % These lines can be in any order, but the key must be before the value
     /EF <<

Checking with veraPDF, we are down to 2 errors. Nice! Next error to tackle:

The MIME type of an embedded file, or a subset of a file, shall be specified using the Subtype key of the file specification dictionary. If the MIME type is not known, the "application/octet-stream" shall be used.

This is slightly trickier. MIME types have a / in them, and since spaces don't really matter in PS, /Type/Filespec is just as valid as /Type /Filespec. Thus, as the mime type must be a name, not a string, we can't simply say /text/plain. Instead we'll need to use the cvn function (spec, page 402), which converts strings to names (like (quote) in lisps, or to_sym in Ruby). Note that this parameter goes on the stream object's dictionary itself (spec, page 104, table 45):

[{fstream} << /Type /EmbeddedFile
    /Subtype (text/plain) cvn % Our new addition (can be on the same line as above)
    >>       /PUT   pdfmark
    % equivalent to the Ruby-syntax: { :Type => :EmbeddedFile, :Subtype => cvn("text/plain") }

Down to our final A-3 validation error, and this one is a bit trickier:

The additional information provided for associated files as well as the usage requirements for associated files indicate the relationship between the embedded file and the PDF document or the part of the PDF document with which it is associated.

We need to split the EMBED from the definition, and update the document /Catalog dictionary to also point to the definition with the /AF key (notes, page 6). Create a new pdfmark dict object, and refactor the code:

[/_objdef {myfileinfo} /type /dict /OBJ pdfmark % create new object
% assign the new pdfmark object
[{myfileinfo} <<
     /Type /Filespec
     /F MyFileName
     /UF MyFileName
     /AFRelationship /Supplement
     /EF <<
       /F {fstream}
     >>
  >> /PUT pdfmark % refactored out of the following line
[/Name MyFileName /FS {myfileinfo} /EMBED pdfmark % updated embed line

% This line was moved from the end of the original PDFA_defs.ps to after our attachment code
[{Catalog} <</OutputIntents [ {OutputIntent_PDFA} ] /AF [{myfileinfo}] >> /PUT pdfmark

Now running this through Ghostscript we get a veraPDF-accepted valid PDF/A-3B with an arbitrary document attachment! For completeness, here is the whole modified PDFA_def.ps file, and the script I used to run it with. Note I've replaced most constants with variables. For multiple attachments you can add more copies of the code we added with more {fstream} and {myfileinfo} objects (with different names, obvously).

The final full listing of our modified PDFA_def_attach.ps:

%!
% This is a modified version of the Ghostscript 9.27 PDFA_def.ps file with 
% that creates a PDF/A-3 file with an embedded attachment

% Define entries in the document Info dictionary :
/ICCProfile (/usr/share/color/argyll/ref/sRGB.icm) % Customize
def

[ /Title (My PDF/A-3 with an embedded attachment) /DOCINFO pdfmark        % Customize

% Define an ICC profile :

[/_objdef {icc_PDFA} /type /stream /OBJ pdfmark
[{icc_PDFA}
<<
  /N currentpagedevice /ProcessColorModel known {
    currentpagedevice /ProcessColorModel get dup /DeviceGray eq
    {pop 1} {
      /DeviceRGB eq
      {3}{4} ifelse
    } ifelse
  } {
    (ERROR, unable to determine ProcessColorModel) == flush
  } ifelse
>> /PUT pdfmark
[{icc_PDFA} ICCProfile (r) file /PUT pdfmark

% Define the output intent dictionary :

[/_objdef {OutputIntent_PDFA} /type /dict /OBJ pdfmark
[{OutputIntent_PDFA} <<
  /Type /OutputIntent             % Must be so (the standard requires).
  /S /GTS_PDFA1                   % Must be so (the standard requires).
  /DestOutputProfile {icc_PDFA}            % Must be so (see above).
  /OutputConditionIdentifier (sRGB)      % Customize
>> /PUT pdfmark

% New code starts here.

% If you want to not use Ghostscript command line arguments, 
% then uncomment these variable definitions
%/MyFileName (my.txt) def % alternative to -sMyFileName=my.txt
%/MyFileToEmbed (/path/to/my/file.txt) def % alternative to -sMyFileToEmbed=/path/to/my/file.txt
%/MyMimeType (text/plain) def % alternative to -sMyMimeType=text/plain

% Define the embedded file objects
[/_objdef {myfileinfo} /type /dict /OBJ pdfmark
[/_objdef {fstream} /type /stream  /OBJ pdfmark

% Load the file to embed
[{fstream} MyFileToEmbed (r) file  /PUT pdfmark

% assign the stream information
[{fstream} <<
    /Type /EmbeddedFile
    /Subtype MyMimeType cvn
%    /Params << % Optional, see Table 46, page 104 for options
%      /Size 1234 % or use a -dMyVarName flag. -d defines numbers, -s Strings
%      /ModDate (D:20211216) % see section 7.9.4, page 87 for full date format
%      % etc... 
%    >>
  >> /PUT pdfmark

% assign the file information
[{myfileinfo} <<
    /Type /Filespec
%    /Desc (My Optional Description) % optional, see page 103, table 44
    /F MyFileName
    /UF MyFileName
    /AFRelationship /Supplement
    /EF <<
      /F {fstream}
    >>
  >> /PUT pdfmark
    
% Embed the stream
[/Name MyFileName /FS {myfileinfo} /EMBED pdfmark
[{fstream} /CLOSE pdfmark

% Updated last line from the original PDFA_defs.ps
[{Catalog} <</OutputIntents [ {OutputIntent_PDFA} ] /AF [{myfileinfo}] >> /PUT pdfmark

The command line (using GS 9.27):

gs -dPDFA=3 -sColorConversionStrategy=RGB -sDEVICE=pdfwrite -dPDFACompatibilityPolicy=1 -dPDFSETTINGS=/default -dAutoRotatePages=/All -sMyFileToEmbed=/tmp/test.png -sMyMimeType=image/png -sMyFileName=the_name_you_see_in_reader.png -dNOPAUSE -dBATCH -dQUIET -o output-a3.pdf PDFA_def_attach.ps input.pdf

(Note in more recent Ghostscript versions you'll need to add --permit-file-read=/tmp/test.png)


1 Technically, as KenS helpfully pointed out in the comments, it's actually a PostScript procedure used in a non-standard way, but as this answer won't be using procedures outside of pdfmark named objects (excluding the already-written parts of PDFA_def.ps), I've glossed over that implementation detail here in favor of the name used in the pdfmark reference manual.
byteit101
  • 3,910
  • 2
  • 20
  • 29
  • This won't work with recent versions of Ghostscript as the security model will not permit file operations on arbitrary files. You either need to add -dNOSAFER to the command line (I don't recommend this) or use --permit-file-read= to add specific files to the list of files which can be read. The construct '{foo}' is not 'a pdfmrk object as stated, it's a perfectly normal PostScript procedure. The '[' token is a mark object and yes it is where pdfmark gets its name from, but it's just a mark on the stack. It isn't necessary for 'pdfmark lines' to start with a '['. – KenS Dec 16 '21 at 13:36
  • @KenS That's good to know. Do you want to edit in those corrections? – byteit101 Dec 16 '21 at 19:41
  • @KenS I've added some clarifications, but am still unsure about some of your points. The [pdfmark spec](https://opensource.adobe.com/dc-acrobat-sdk-docs/acrobatsdk/pdfs/acrobatsdk_pdfmark.pdf) on page 10&11 calls them "named objects", not procedures. Is that not accurate, either in general, or in the usage in this answer? For the start of line, I had simplified the pdfmark line (page 7) syntax start as I've mostly seen `[` and not `mark` to start pdfmark lines, and was only explaining it so it can be read in this answer. Or is there something else I'm missing about that? – byteit101 Dec 17 '21 at 05:49
  • The point is that you have put, in a paragraph described as a PostScript primer , a statement that '{...}' is a pdfmark-specific named object. It isn't, it's a PostScript procedure. The way it is then used by the pdfmark non-standard PostScript operator and a Distiller-like device to create a PDF named object is a somewhat different point I would say. Again my point about the use of '[' was in relation to it's use in PostScript. Yes it is used that way (start of line) in all the pdfmark examples but it doesn't have to be, and in PS terms it is a mark. Your revised text there is fine. – KenS Dec 17 '21 at 08:11
  • @KenS That is true, though as the goal of that section was to only be a "brief primer" merely sufficient for understanding the syntax of the new code written. As such, I deliberately called it that, as the implementation details weren't relevant to reading the code. I didn't think the general term would have helped comprehension of this post for people not already familiar with PS. I've added a footnote to explain this lie-to-children, for those who wish to learn more details of PS. Do you think that is reasonable? – byteit101 Dec 17 '21 at 23:55