Apache PDFBox and PDF/A-3

Question

Is it possible to use Apache PDFBox to process PDF/A-3 documents? (Especially for changing field values?)

The PDFBox 1.8 Cookbook says that it is possible to create PDF/A-1 documents with pdfaid.setPart(1);

Can I apply pdfaid.setPart(3) for a PDF/A-3 document?
If not: Is it possible to read in a PDF/A-3 document, change some field values and safe it by what I have not need for >creation/conversion to PDF/A-3< but the document is still PDF/A-3?

Your question was already answered correctly (and very nicely) in the PDFBox user mailing list. — Tilman Hausherr, Aug 16 '16 at 11:25

score 5 · Accepted Answer · answered Aug 16 '16 at 12:13

PDFBox supports that but please be aware that due to the fact that PDFBox is a low level library you have to ensure the conformance yourself i.e. there is no 'Save as PDF/A-3'. You might want to take a look at http://www.mustangproject.org which uses PDFBox to support ZUGFeRD (electronic invoicing) which also needs PDF/A-3.

Madal Africa-Guinea · Answer 2 · 2017-10-20T10:29:15.730

How to create a PDF/A {2,3} - {B, U, A) valid: In this example I convert the PDF to Image, then I create a valid PDF / Ax-y with the image. PDFBOX2.0x

public static void main(String[] args) throws IOException, TransformerException
{

    String resultFile = "result/PDFA-x.PDF";  
    FileInputStream in = new FileInputStream("src/PDFOrigin.PDF");

    PDDocument doc = new PDDocument();
    try 
    {
        PDPage page = new PDPage();
        doc.addPage(page); 
        doc.setVersion(1.7f);

        /*             
        // A PDF/A file needs to have the font embedded if the font is used for text rendering
        // in rendering modes other than text rendering mode 3.
        //
        // This requirement includes the PDF standard fonts, so don't use their static PDFType1Font classes such as
        // PDFType1Font.HELVETICA.
        //
        // As there are many different font licenses it is up to the developer to check if the license terms for the
        // font loaded allows embedding in the PDF.

        String fontfile = "/org/apache/pdfbox/resources/ttf/ArialMT.ttf"; 
        PDFont font = PDType0Font.load(doc, new File(fontfile));           
        if (!font.isEmbedded())
        {
            throw new IllegalStateException("PDF/A compliance requires that all fonts used for"
                    + " text rendering in rendering modes other than rendering mode 3 are embedded.");
        }
      */ 

        PDPageContentStream contents = new PDPageContentStream(doc, page);
        try 
        {   
            PDDocument docSource = PDDocument.load(in);
            PDFRenderer pdfRenderer = new PDFRenderer(docSource);               
            int numPage = 0;

            BufferedImage imagePage = pdfRenderer.renderImageWithDPI(numPage, 200); 
            PDImageXObject pdfXOImage = LosslessFactory.createFromImage(doc, imagePage);

            contents.drawImage(pdfXOImage, 0,0, page.getMediaBox().getWidth(), page.getMediaBox().getHeight());
            contents.close();   

        }catch (Exception e) {
            // TODO: handle exception
        }

        // add XMP metadata
        XMPMetadata xmp = XMPMetadata.createXMPMetadata();
        PDDocumentCatalog catalogue = doc.getDocumentCatalog();
        Calendar cal =  Calendar.getInstance();          

        try
        {
            DublinCoreSchema dc = xmp.createAndAddDublinCoreSchema();
           // dc.setTitle(file);
            dc.addCreator("My APPLICATION Creator");
            dc.addDate(cal);

            PDFAIdentificationSchema id = xmp.createAndAddPFAIdentificationSchema();
            id.setPart(3);  //value => 2|3
            id.setConformance("A"); // value => A|B|U

            XmpSerializer serializer = new XmpSerializer();
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            serializer.serialize(xmp, baos, true);

            PDMetadata metadata = new PDMetadata(doc);
            metadata.importXMPMetadata(baos.toByteArray());                
            catalogue.setMetadata(metadata);
        }
        catch(BadFieldValueException e)
        {
            throw new IllegalArgumentException(e);
        }

        // sRGB output intent
        InputStream colorProfile = CreatePDFA.class.getResourceAsStream(
                "../../../pdmodel/sRGB.icc");
        PDOutputIntent intent = new PDOutputIntent(doc, colorProfile);
        intent.setInfo("sRGB IEC61966-2.1");
        intent.setOutputCondition("sRGB IEC61966-2.1");
        intent.setOutputConditionIdentifier("sRGB IEC61966-2.1");
        intent.setRegistryName("http://www.color.org");

        catalogue.addOutputIntent(intent);  
        catalogue.setLanguage("en-US");

        PDViewerPreferences pdViewer =new PDViewerPreferences(page.getCOSObject());
        pdViewer.setDisplayDocTitle(true);; 
        catalogue.setViewerPreferences(pdViewer);

        PDMarkInfo  mark = new PDMarkInfo(); // new PDMarkInfo(page.getCOSObject()); 
        PDStructureTreeRoot treeRoot = new PDStructureTreeRoot(); 
        catalogue.setMarkInfo(mark);
        catalogue.setStructureTreeRoot(treeRoot);           
        catalogue.getMarkInfo().setMarked(true);

        PDDocumentInformation info = doc.getDocumentInformation();               
        info.setCreationDate(cal);
        info.setModificationDate(cal);            
        info.setAuthor("My APPLICATION Author");
        info.setProducer("My APPLICATION Producer");;
        info.setCreator("My APPLICATION Creator");
        info.setTitle("PDF title");
        info.setSubject("PDF to PDF/A{2,3}-{A,U,B}");           

        doc.save(resultFile);
    }catch (Exception e) {
        throw new IllegalArgumentException(e);
    }
}

The answer is probably ok (I would have to run the code through a validator to be sure); but decompressing a jpeg file is inefficient. Use `JPEGFactory.createFromStream()` instead. This uses the jpg file as it is. It would be nice to change the code to avoid all the copy & paste people to use that part. And if you'd still want to decode the Jpeg to get a BufferedImage, only a single line is needed: ImageIO.read(). Your many lines are either outdated or very new :-) — Tilman Hausherr, Oct 18 '17 at 11:18
Here the objective is not to decompress a JPEG :). Otherwise you can directly produce BufferedImage from the PDF page with PDFRenderer.renderImageWithDPI (...) with PDFBOX. On the other hand the result has been validated by: pdf-online. — Madal Africa-Guinea, Oct 18 '17 at 13:56
I know that the objective is to produce a PDF. My remark is about the image in the PDF. Your usage of LosslessFactory with a jpeg file makes it slower (because it will decompress the jpeg and recompress it with Flate compression) and will usually produce bigger PDFs than if you use JPEGFactory with a stream input. — Tilman Hausherr, Oct 18 '17 at 14:34
I totally agree with you on this point, I extracted this code in a project that makes image compression / decompression of different formats. I wanted to make it simple for comprehension but it was only after posting that I realized that it would be even simpler for the comprehension to extract directly the image (BufferedImage) of the page of the source PDF with : PDFRenderer.renderImageWithDPI (numPage, dpi, ..). Thank you for informing me when you have had time to test the validity of the result PDF / A3-A .. — Madal Africa-Guinea, Oct 19 '17 at 13:41
With the latest version of PdfBox some methods arent exist. See how to fix those under this link https://www.programcreek.com/java-api-examples/?api=org.apache.pdfbox.cos.COSDocument — Buminda, Nov 19 '20 at 12:39

Apache PDFBox and PDF/A-3

2 Answers2