Convert a PDF with forms to a PDF with text only (preserve data) using iText

Question

I have multiple PDFs that get populated with multiple records (a.pdf,b.pdf,c[0-9].pdf,d[0-9].pdf,ez.pdf) using acroforms and pdfbox.
The resulting files (aflat.pdf,bflat.pdf,c[0-9]flat.pdf,d[0-9]flat.pdf,ezflat.pdf) should have their forms(dictionaries and whatever adobe uses) removed but the fields filled as raw text saved on the pdf (setReadOnly is not what I want!).

PdfStamper can only remove fields without saving their content but I've found some references to PdfContentByte as a way to save the content. Alas, the documentation is too brief to understand how I should do this.

As a last resort I could use FieldPosition to write directly on the PDF. Has anyone ever encountered such problem? How do I solve it?

UPDATE: Saving a single page of b.pdf yields a valid bfilled.pdf but a blank bflattened.pdf. Saving the whole document solved the issue.

    populateB();
    try (PDDocument doc = new PDDocument(); FileOutputStream stream = new FileOutputStream("bfilled.pdf")) {
        //importing the page will corrupt the fields
        /*wrong approach*/doc.importPage((PDPage)pdfDocuments.get(0).getDocumentCatalog().getAllPages().get(0));
        /*wrong approach*/doc.save(stream);
        //save the whole document instead
        pdfDocuments.get(0).save(stream);//<---right approach

    }
    try (FileOutputStream stream = new FileOutputStream("bflattened.pdf")) {
        PdfStamper stamper = new PdfStamper(new PdfReader("bfilled.pdf"), stream);
        stamper.setFormFlattening(true);
        stamper.close();
    }

Your say your PDFs were *populated using acroforms and pdfbox*. Have appearance streams been created for those fields by PDFBox? If not, you might want to read [this answer](http://stackoverflow.com/a/20527155/1729265) by Bruno. — mkl, Feb 24 '15 at 13:54
The documentation says: Sets the option to generate appearances. Not generating appearances will speed-up form filling but the results can be unexpected in Acrobat. Don't use it unless your environment is well controlled. The default is true. — Gabriele Frau, Feb 24 '15 at 14:04

score 3 · Answer 1 · answered Feb 24 '15 at 11:34

3

Use PdfStamper.setFormFlattening(true) to get rid of the fields and write them as content.

answered Feb 24 '15 at 11:34

Paulo Soares

1,896
8
21
19

It doesn't save the content of the field. There must be some steps I'm missing. Would you care to elaborate? :) P.S. Should I use pdfStamper.getOverContent(pageNum) to get the content and then do something? – Gabriele Frau Feb 24 '15 at 11:43
Can you post the pdf? – Paulo Soares Feb 24 '15 at 13:35
The pdf is as standard as they get, there shouldn't be a problem flattening it. Show me your flattening code. – Paulo Soares Feb 24 '15 at 14:18
I tried it and `PdfStamper` with `setFormFlattening(true)` flattens the field contents into the page content as expected. If it doesn't for you, you should edit your question to include enough code to allow reprocing the issue. – mkl Feb 24 '15 at 14:19

score 1 · Answer 2 · answered Feb 24 '15 at 15:14

1

Always use the whole page when working with acroforms

    populateB();
try (PDDocument doc = new PDDocument(); FileOutputStream stream = new FileOutputStream("bfilled.pdf")) {
    //importing the page will corrupt the fields
    doc.importPage((PDPage) pdfDocuments.get(0).getDocumentCatalog().getAllPages().get(0));
    doc.save(stream); 
    //save the whole document instead
    pdfDocuments.get(0).save(stream);

}
try (FileOutputStream stream = new FileOutputStream("bflattened.pdf")) {
    PdfStamper stamper = new PdfStamper(new PdfReader("bfilled.pdf"), stream);
    stamper.setFormFlattening(true);
    stamper.close();
}

answered Feb 24 '15 at 15:14

Gabriele Frau

11
5

So this essentially was a problem in the context of PDFBox use, not iText, wasn't it? BTW, your code as shown here now saves two distinct documents to the same output stream. I assume your comments in that code mean not to use the first save but the second one. But there are so many people around who copy&paste code without reading comments... – mkl Feb 24 '15 at 15:28
Essentialy yes. I assumed PDPage had any information needed. I left the wrong instructions uncommented to leave them more readable, I'll add a little comment on the side just to be sure. – Gabriele Frau Feb 24 '15 at 16:01

Convert a PDF with forms to a PDF with text only (preserve data) using iText

2 Answers2