How to Extract Diagonal watermark from pdf using PDFBOX and Extract Text by maintaining alignment

Question

How can I extract diagonal watermark text from PDF using PDFBox ?

After referring to ExtractText's rotationMagic option, I am now extracting vertical and horizontal watermarks but not diagonal. This is my code so far.

class AngleCollector extends PDFTextStripper {
    private final Set<Integer> angles = new TreeSet<>();

    AngleCollector() throws IOException {}

    Set<Integer> getAngles() {
        return angles;
    }

    @Override
    protected void processTextPosition(TextPosition text) {
        int angle = ExtractText.getAngle(text);
        angle = (angle + 360) % 360;
        angles.add(angle);
    }
}

class FilteredTextStripper extends PDFTextStripper {
    FilteredTextStripper() throws IOException {
    }

    @Override
    protected void processTextPosition(TextPosition text) {
        int angle = ExtractText.getAngle(text);
        if (angle == 0) {
            super.processTextPosition(text);
        }
    }
}

final class ExtractText {
    static int getAngle(TextPosition text) {
        //The Matrix containing the starting text position
        Matrix m = text.getTextMatrix().clone();
        m.concatenate(text.getFont().getFontMatrix());
        return (int) Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
    }

    private List<String> getAnnots(PDPage page) throws IOException {
        List<String> returnList = new ArrayList<>();
        for (PDAnnotation pdAnnot : page.getAnnotations()) {
                if(pdAnnot.getContents() != null && !pdAnnot.getContents().isEmpty()) {
                    returnList.add(pdAnnot.getContents());
                }
        }
        return returnList;
    }

    public void extractPages(int startPage, int endPage, PDFTextStripper stripper, PDDocument document, Writer output) {
        for (int p = startPage; p <= endPage; ++p) {
            stripper.setStartPage(p);
            stripper.setEndPage(p);
            try {

                PDPage page = document.getPage(p - 1);
                for (var annot : getAnnots(page)) {
                    output.write(annot);
                }

                int rotation = page.getRotation();
                page.setRotation(0);
                var angleCollector = new AngleCollector();
                angleCollector.setStartPage(p);
                angleCollector.setEndPage(p);
                angleCollector.writeText(document, output);

                for (int angle : angleCollector.getAngles()) {
                    // prepend a transformation

                    try (var cs = new PDPageContentStream(document, page,
                            PDPageContentStream.AppendMode.PREPEND, false)) {
                        cs.transform(Matrix.getRotateInstance(-Math.toRadians(angle), 0, 0));
                    }

                    stripper.writeText(document, output);

                    // remove prepended transformation
                    ((COSArray) page.getCOSObject().getItem(COSName.CONTENTS)).remove(0);
                }
                page.setRotation(rotation);

            } catch (IOException ex) {
                System.err.println("Failed to process page " + p + ex);
            }
        }
    }
}

public class pdfTest {
    private pdfTest() {
    }

    public static void main(String[] args) throws IOException {
        var pdfFile = "test-resources/pdf/pdf_sample_2.pdf";
        Writer output = new OutputStreamWriter(System.out, StandardCharsets.UTF_8);
        var etObj = new ExtractText();
        var rawDoc = PDDocument.load(new File(pdfFile));
        PDFTextStripper stripper = new FilteredTextStripper();

        if(rawDoc.getDocumentCatalog().getAcroForm() != null) {
            rawDoc.getDocumentCatalog().getAcroForm().flatten();
        }

        etObj.extractPages(1, rawDoc.getNumberOfPages(), stripper, rawDoc, output);
        output.flush();
    }
}

Edit 1: I am also unable to detect form (Acro, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?

I am attaching the sample PDFs for references. Sample PDF 1 Sample PDF 2

I require following things using PDFBox

Diagonal text detection. (including watermarks).
Form fields extraction by maintaining Proper alignment.

Note: I am not looking image watermarks, just plain text watermarks. — Sahib Yar, Dec 02 '21 at 16:23
Get the source code of ExtractText. There is an option called "rotationMagic" which separates texts by angles. — Tilman Hausherr, Dec 03 '21 at 03:42
Please share the PDF. Maybe the watermark was done with vector graphics. — Tilman Hausherr, Dec 04 '21 at 10:22
The XFA / form fields thing is a different question. I ran ExtractText with `-rotationMagic` on `double_watermark.pdf` and it got two watermarks but not the 45° one, because this one is a vector graphic. Try marking it in Adobe Reader, you won't be able to. — Tilman Hausherr, Dec 07 '21 at 13:13
Being a vector I don't see a deterministic solution. The simplest would be getting rid of objects lighter than a specific color, but it is far from a good and predictable solution. And more complex would start involving text interpretation which is complex. PDF interpretation is a hell if you don't have some way to know beforehand their input format in a more predictabe manner. — brunoff, Dec 07 '21 at 19:22

mkl · Answer 1 · 2021-12-08T10:56:45.047

In your "question" you actually ask multiple distinct questions. I'll look into each of them. The answers will be less specific than you'd probably wish because your questions are based on assumptions that are not all true.

"How can I extract diagonal watermark text from PDF using PDFBox ?"

First of all, PDF text extraction works by inspecting the instructions in content streams of a page and contained XObjects, finding text drawing instructions therein, taking the coordinates and orientations and the string parameters thereof, mapping the strings to Unicode, and arranging the many individual Unicode strings by their coordinates and orientations in a single content string.

In case of PDFBox the PDFTextStripper as-is does this with a limited support for orientation processing, but it can be extended to filter the text pieces by orientation for better orientation support as shown in the ExtractText example with rotation magic activated.

double_watermark.pdf

In case of your double_watermark.pdf example PDF, though, the diagonal text "Top Secret" is not created using text drawing instructions but instead path construction and painting instructions, as Tilman already remarked. (Actually the paths here all are sequences of very short lines, no curves are used, which you can see using a high zoom factor.) Thus, PDF text extraction cannot extract this text.

To answer your question

How can I extract diagonal watermark text from PDF using PDFBox ?

in this context, therefore: You can not.

(You can of course use PDFBox as a PDF processing framework based on which you also collect paths and try to match them to characters, but would be a greater project by itself. Or you can use PDFBox to draw the pages as bitmaps and apply OCR to those bitmaps.)

"I am also unable to detect form (Acro, XFA) field contents via TextExtractor code with correct Alignment. How can I do that ?"

Form data in AcroForm or XFA form definitions are not part of the page content streams or the XObject content streams referenced from therein. Thus, they are not immediately subject to text extraction.

AcroForm forms

AcroForm form fields are abstract PDF data objects which may or may not have associated content streams for display. To include them into the content streams text extraction operates on, you can first flatten the form. As you mentioned in your own answer, you also have to activate sorting to extract the field contents in context.

Beware, PDF renderers do have certain freedoms when creating the visualization of a form field. Thus, text extraction order may be slightly different from what you expect.

XFA forms

XFA form definitions are a cuckoo's egg in PDF. They are XML streams which are not related to regular PDF objects; furthermore, XFA in PDFs has been deprecated a number of years ago. Thus, most PDF libraries don't support XFA forms.

PDFBox only allows to extract or replace the XFA XML stream. Thus, there is no immediate support for XFA form contents during text extraction.

score 0 · Answer 2 · answered Dec 08 '21 at 04:51

0

Form fields extraction by maintaining Proper alignment.

This is solved by setSortByPosition

answered Dec 08 '21 at 04:51

Sahib Yar

1,030
11
29