8

I'm using iTextSharp to generate pdf-a documents from images. So far I've not been successful.
Edit: I'm using iTextSharp to generate the PDF

All I try is to make a pdf-a document (1a or 1b, whatever suits), with some images. This is the code I've come up so far, but I keep getting errors when I try to validate them with pdf-tools or validatepdfa.

This are the errors I get from pdf-tools (using PDF/A-1b validation): Edit: MarkInfo and Color Space arn't yet working. The rest is okay

Validating file "0.pdf" for conformance level pdfa-1a
The key MarkInfo is required but missing.
A device-specific color space (DeviceRGB) without an appropriate output intent is used.
The document does not conform to the requested standard.
The document contains device-specific color spaces.
The document doesn't provide appropriate logical structure information.
Done.

Main flow

var output = new MemoryStream();
using (var iccProfileStream = new FileStream("ToPdfConverter/ColorProfiles/sRGB_v4_ICC_preference_displayclass.icc", FileMode.Open))
{
    var document = new Document(new Rectangle(PageSize.A4.Width, PageSize.A4.Height), 0f, 0f, 0f, 0f);
    var pdfWriter = PdfWriter.GetInstance(document, output);
    pdfWriter.PDFXConformance = PdfWriter.PDFA1A;
    document.Open();

    var pdfDictionary = new PdfDictionary(PdfName.OUTPUTINTENT);
    pdfDictionary.Put(PdfName.OUTPUTCONDITION, new PdfString("sRGB IEC61966-2.1"));
    pdfDictionary.Put(PdfName.INFO, new PdfString("sRGB IEC61966-2.1"));
    pdfDictionary.Put(PdfName.S, PdfName.GTS_PDFA1);

    var iccProfile = ICC_Profile.GetInstance(iccProfileStream);
    var pdfIccBased = new PdfICCBased(iccProfile);
    pdfIccBased.Remove(PdfName.ALTERNATE);
    pdfDictionary.Put(PdfName.DESTOUTPUTPROFILE, pdfWriter.AddToBody(pdfIccBased).IndirectReference);

    pdfWriter.ExtraCatalog.Put(PdfName.OUTPUTINTENT, new PdfArray(pdfDictionary));

    var image = PrepareImage(imageBytes);

    document.Open();
    document.Add(image);

    pdfWriter.CreateXmpMetadata();

    pdfWriter.CloseStream = false;
    document.Close();
}
return output.GetBuffer();

This is prepareImage()
It's used to flatten the image to bmp, so I don't need to bother about alpha channels.

private Image PrepareImage(Stream stream)
{
    Bitmap bmp = new Bitmap(System.Drawing.Image.FromStream(stream));
    var file = new MemoryStream();
    bmp.Save(file, ImageFormat.Bmp);
    var image = Image.GetInstance(file.GetBuffer());

    if (image.Height > PageSize.A4.Height || image.Width > PageSize.A4.Width)
    {
        image.ScaleToFit(PageSize.A4.Width, PageSize.A4.Height);
    }
    return image;
}

Can anyone help me into a direction to fix the errors? Specifically the device-specific color spaces

Edit: More explanation: What I'm trying to achieve is, converting scanned images to PDF/A for long-term data storage

Edit: added some files I'm using to test with
PDFs and Pictures.rar (3.9 MB)
https://mega.co.nz/#!n8pClYgL!NJOJqSO3EuVrqLVyh3c43yW-u_U35NqeB0svc6giaSQ

Filburt
  • 17,626
  • 12
  • 64
  • 115
Highmastdon
  • 6,960
  • 7
  • 40
  • 68
  • It might be worth raising a bug with the iText people. – Rup Apr 09 '13 at 08:49
  • Why do you set conformance level to PDF/A-1a and then check against 1b? It would be good to be consistent. Also, why do you open the document twice? Also, I would try to resolve the other errors first - the errors you have with file structure being corrupted and so on, could easily interfere with the (lesser) problem you have with color spaces... – David van Driessche Apr 09 '13 at 08:57
  • @David Okay, thanks for your reply. Though I've got already almost everything correctly working now. Only the `color space` isn't correct. I've added some edits to the code. – Highmastdon Apr 09 '13 at 09:51
  • What's the color space of the image you are inserting? And could you share an example PDF? That way I could run it through the pdfToolbox PDF/A verification and perhaps have a more descriptive error message. – David van Driessche Apr 09 '13 at 10:11
  • What we're trying to do is convert scanned images to PDF/A for long-term data storage. I've uploaded a zip with the files I'm using for testing: PDFs and Pictures.rar (3.9 MB) https://mega.co.nz/#!n8pClYgL!NJOJqSO3EuVrqLVyh3c43yW-u_U35NqeB0svc6giaSQ – Highmastdon Apr 09 '13 at 10:16
  • Have you tried using the PdfAWriter? Since last summer the PDF/A specific functionality has been moved into separate classes. In Java they even have been moved to a separate jar, maybe it is the same with iTextSharp? – mkl Apr 09 '13 at 12:39

2 Answers2

1

OK, I checked one of your files in callas pdfToolbox and it says: "Device color space used but no PDF/A output intent". Which I took as a sign that you do something wrong while writing an output intent to the document. I then converted that document to PDF/A-1b with the same tool and the difference is obvious.

Perhaps there are other errors you need to fix, but the first error here is that you put a key in the catalog dict for the PDF file that is named "OutputIntent". That's wrong: page 75 of the PDF Specification states that the key should be named "OutputIntents".

Like I said, perhaps there are other problems with your file beyond this, but the wrong name for the key causes PDF/A validators not to find the Output Intent you try to put in the file...

David van Driessche
  • 6,602
  • 2
  • 28
  • 41
  • +1; if @Highmastdon had used the method `PdfWriter.SetOutputIntents` instead, the correct name would have been used... If he had used `PdfAWriter` instead of `PdfWriter`, some more stuff would have automatically been taken care of. – mkl Apr 10 '13 at 07:39
0
  1. First of all, pdfx IS NOT pdfa.

    1. Second, you're using wrong PdfWriter. It should be PdfAWriter.

I do not have solution for image problem unfortunatelly, but I have for 1 and 2.

Regards

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;
using System.Text;
using System.IO;
using iTextSharp.text;
using iTextSharp.text.pdf;
using iTextSharp.text.html.simpleparser;
using iTextSharp.tool.xml;
using System.Drawing;
using System.Drawing.Imaging;

namespace Tests
{
    /*
     * References:  
     * UTF-8 encoding http://stackoverflow.com/questions/4902033/itextsharp-5-polish-character
     * PDFA http://www.codeproject.com/Questions/661704/Create-pdf-A-using-itextsharp
     * Images http://stackoverflow.com/questions/15896581/make-a-pdf-conforming-pdf-a-with-only-images-using-itextsharp
     */

    [TestClass]
    public class UnitTest1
    {
        /*
         * IMPORTANT: Restrictions with html usage of tags and attributes
         * 1. Dont use * <head> <title>Sklep</title> </head>, because title is rendered to the page
         */

        // Test cases
        static string contents = "<html><body style=\"font-family:arial unicode ms;font-size: 8px;\"><p style=\"text-align: center;\"> Davčna številka dolžnika: 74605968<br /> </p><table> <tr> <td><b>\u0160t. sklepa: 88711501</b></td> <td style=\"text-align: right;\">Davčna številka dolžnika: 74605968</td> </tr> </table> <br/><img src=\"http://img.rtvslo.si/_static/images/rtvslo_mmc_logo.png\" /></body></html>";
        //static string contents = "<html><body style=\"font-family:arial unicode ms;font-size: 8px;\"><p style=\"text-align: center;\"> Davčna številka dolžnika: 74605968<br /> </p><table> <tr> <td><b>\u0160t. sklepa: 88711501</b></td> <td style=\"text-align: right;\">Davčna številka dolžnika: 74605968</td> </tr> </table> <br/></body></html>";

        //[TestMethod]
        public void CreatePdfHtml()
        {
            createPDF(contents, true);        
        }

        private void createPDF(string html, bool isPdfa)
        {
            TextReader reader = new StringReader(html);
            Document document = new Document(PageSize.A4, 30, 30, 30, 30);
            HTMLWorker worker = new HTMLWorker(document);

            PdfWriter writer;
            if (isPdfa)
            {
                //set conformity level
                writer = PdfAWriter.GetInstance(document, new FileStream(@"c:\temp\testA.pdf", FileMode.Create), PdfAConformanceLevel.PDF_A_1B);

                //set pdf version
                writer.SetPdfVersion(PdfAWriter.PDF_VERSION_1_4);

                // Create XMP metadata. It's a PDF/A requirement.
                writer.CreateXmpMetadata();
            }
            else
            {
                writer = PdfWriter.GetInstance(document, new FileStream(@"c:\temp\test.pdf", FileMode.Create));
            }

            document.Open();

            if (isPdfa) // document should be opend, or it will fail
            {
                // Set output intent for uncalibrated color space. PDF/A requirement.
                ICC_Profile icc = ICC_Profile.GetInstance(Environment.GetEnvironmentVariable("SystemRoot") +  @"\System32\spool\drivers\color\sRGB Color Space Profile.icm");
                writer.SetOutputIntents("Custom", "", "http://www.color.org", "sRGB IEC61966-2.1", icc);
            }

            //register font used in html
            FontFactory.Register(Environment.GetEnvironmentVariable("SystemRoot") + "\\Fonts\\ARIALUNI.TTF", "arial unicode ms");

            //adding custom style attributes to html specific tasks. Can be used instead of css
            //this one is a must fopr display of utf8 language specific characters (čćžđpš)
            iTextSharp.text.html.simpleparser.StyleSheet ST = new iTextSharp.text.html.simpleparser.StyleSheet();
            ST.LoadTagStyle("body", "encoding", "Identity-H");
            worker.SetStyleSheet(ST);

            worker.StartDocument();
            worker.Parse(reader);
            worker.EndDocument();
            worker.Close();
            document.Close();
        }

    }


}
Mitja Gustin
  • 1,723
  • 13
  • 17