Extracting answers to a flattened PDF form with iText 7

Question

We have a few forms created from Adobe LiveCycle where users fill the dynamic forms and submits the document to our office where we stamp it with our signature and flatten it (at least most of the time - I've seen a few documents in our system that haven't been flattened yet but that can be a separate question, I'll focus on the flattened documents here because that's most of what we have).

I'm trying to use iText 7 to parse/extract the user's answers to our forms for migrating to an electronic solution that will happen a few months from now. I was able to make the example work in Java but I don't understand the process.

/*
    This file is part of the iText (R) project.
    Copyright (c) 1998-2020 iText Group NV
    Authors: iText Software.
 
    For more information, please contact iText Software at this address:
    sales@itextpdf.com
 */
/**
 * Example written by Bruno Lowagie in answer to:
 * http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
 */
package ca.umanitoba.ad.research;
 
import com.itextpdf.kernel.font.PdfFont;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.canvas.parser.EventType;
import com.itextpdf.kernel.pdf.canvas.parser.PdfCanvasProcessor;
import com.itextpdf.kernel.pdf.canvas.parser.data.IEventData;
import com.itextpdf.kernel.pdf.canvas.parser.data.TextRenderInfo;
import com.itextpdf.kernel.pdf.canvas.parser.filter.TextRegionEventFilter;
import com.itextpdf.kernel.pdf.canvas.parser.listener.FilteredEventListener;
import com.itextpdf.kernel.pdf.canvas.parser.listener.LocationTextExtractionStrategy;
 
import java.io.File;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.FileOutputStream;
import java.io.Writer;
import java.io.BufferedWriter;
 
public class Main {
    public static final String DEST = "./target/txt/parse_custom.txt";
 
    public static final String SRC = "./src/main/resources/pdfs/nameddestinations.pdf";
 
    public static void main(String[] args) throws IOException {
        File file = new File(DEST);
        file.getParentFile().mkdirs();
 
        new Main().manipulatePdf(DEST);
    }
 
    protected void manipulatePdf(String dest) throws IOException {
        PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC));
 
        Rectangle rect = new Rectangle(36, 750, 523, 56);
        CustomFontFilter fontFilter = new CustomFontFilter(rect);
        FilteredEventListener listener = new FilteredEventListener();
 
        // Create a text extraction renderer
        LocationTextExtractionStrategy extractionStrategy = listener
                .attachEventListener(new LocationTextExtractionStrategy(), fontFilter);
 
        // Note: If you want to re-use the PdfCanvasProcessor, you must call PdfCanvasProcessor.reset()
        PdfCanvasProcessor parser = new PdfCanvasProcessor(listener);
        parser.processPageContent(pdfDoc.getFirstPage());
 
        // Get the resultant text after applying the custom filter
        String actualText = extractionStrategy.getResultantText();
 
        pdfDoc.close();
 
        // See the resultant text in the console
        System.out.println(actualText);
 
        try (Writer writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(dest)))) {
            writer.write(actualText);
        }
    }
 
    /*
     * The custom filter filters only the text of which the font name ends with Bold or Oblique.
     */
    protected class CustomFontFilter extends TextRegionEventFilter {
        public CustomFontFilter(Rectangle filterRect) {
            super(filterRect);
        }
 
        @Override
        public boolean accept(IEventData data, EventType type) {
            if (type.equals(EventType.RENDER_TEXT)) {
                TextRenderInfo renderInfo = (TextRenderInfo) data;
                PdfFont font = renderInfo.getFont();
                if (null != font) {
                    String fontName = font.getFontProgram().getFontNames().getFontName();
                    return fontName.endsWith("Bold") || fontName.endsWith("Oblique");
                }
            }
 
            return false;
        }
    }
}

Why is there a need to specify a Rectangle? Our forms are dynamic so users can add more fields as needed and we also accept paragraphs on some of the questions so the length will always vary so it's unlikely that the coordinates of the texts will be the same.

How can I change the flow so that I can perhaps just search for the question and then get the text right after it (presumably the answer) - I don't really know what the best way to parse a PDF is. If there's no other way except providing a Rectangle, can I programmatically determine the coordinates/dimensions of the rectangles?

From the example it looks like it's filtering the text based on whether it's bolded or italicized which I probably don't need but it looks to be easy enough to fix by modifying/removing the accept() method.

Can you attach the example pdf? There are some strategies in iText out of the Box and the possible flow depends on the document structure. P.S You can have a look at RegexBasedLocationExtractionStrategy. It returns rectangles that can be used in the location strategy — Pavel Chermyanin, May 21 '20 at 02:04
@PavelChermyanin I don't know if I'm able to attach our forms as it does contain some contact information. I have been using [`PdfAcroForm.getFormFields()`](https://api.itextpdf.com/iText7/java/7.1.7/com/itextpdf/forms/PdfAcroForm.html#getFormFields--) but nothing is returned. I'm thinking perhaps because the fields are not under a form element much like how it should be in HTML? — dokgu, May 21 '20 at 14:45
After flattening a pdf file does not contain an acroform anymore. Of course you can extract field value if a pdf has not been flattened yet.. but as you mentioned it is a separate question — Pavel Chermyanin, May 21 '20 at 15:58

score 0 · Answer 1 · edited May 25 '20 at 20:58

Please take a look at what that example is for: In the JavaDoc comment you can read

/**
 * Example written by Bruno Lowagie in answer to:
 * http://stackoverflow.com/questions/24506830/can-we-use-text-extraction-strategy-after-applying-location-extraction-strategy
 */

and that stack overflow question starts with

I used the following code to get data in PDF from a particular location. I want to get bold text present in that location

When you wonder, therefore,

Why is there a need to specify a Rectangle?

the answer is: because the example is about finding bold text in a particular location.

You mention your forms were dynamic before flattening and fields don't have fixed positions. Thus, this filter probably is not optimal for your use case.

How can I change the flow so that I can perhaps just search for the question and then get the text right after it

In that case simply don't filter at all but use a plain LocationTextExtractionStrategy to extract text, search for the question text in the extracted text, and use the text thereafter up to the next question text.

Alternatively, if you still have the unflattened dynamic forms, you may consider extracting the xfa xml and extract the filled-in data from that xml.

Thanks. I've been thinking about this and instead of searching for the question and getting the answer next to it, I think the best solution is to use the name of the input fields. I'm currently looking at [`Element.select()`](https://api.itextpdf.com/iText7/java/7.1.8/com/itextpdf/styledxmlparser/jsoup/nodes/Element.html#select-java.lang.String-) but haven't made any progress yet. This approach is what I'm more familiar with because it's like Javascript. Any pointers? — dokgu, May 21 '20 at 14:43
`Element.select()` is from the styled XML parser. It is used in particular in html-to-pdf and svg-to-pdf contexts. To what do you apply it? — mkl, May 26 '20 at 04:46

Extracting answers to a flattened PDF form with iText 7

1 Answers1