PDFBox 2.0 RC3 -- Find and replace text

Question

How can one find and replace text inside a PDF document using PDFBox 2.0, they pulled the old example and it's syntax no longer works so I am wondering if it's still possible and if so what the best way to go about it is. Thanks!

That old example actually only worked in very simple PDFs and didn't change or (even worse) damaged more complex ones. — mkl, Feb 15 '16 at 23:08
https://github.com/chadilukito/Apache-PdfBox-2-Examples/blob/master/ReplaceText.java — Hrvoje, May 20 '21 at 11:27

score 11 · Answer 1 · answered Apr 04 '16 at 13:42

11

You can try like this:

public static PDDocument replaceText(PDDocument document, String searchString, String replacement) throws IOException {
    if (Strings.isEmpty(searchString) || Strings.isEmpty(replacement)) {
        return document;
    }
    PDPageTree pages = document.getDocumentCatalog().getPages();
    for (PDPage page : pages) {
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        List tokens = parser.getTokens();
        for (int j = 0; j < tokens.size(); j++) {
            Object next = tokens.get(j);
            if (next instanceof Operator) {
                Operator op = (Operator) next;
                //Tj and TJ are the two operators that display strings in a PDF
                if (op.getName().equals("Tj")) {
                    // Tj takes one operator and that is the string to display so lets update that operator
                    COSString previous = (COSString) tokens.get(j - 1);
                    String string = previous.getString();
                    string = string.replaceFirst(searchString, replacement);
                    previous.setValue(string.getBytes());
                } else if (op.getName().equals("TJ")) {
                    COSArray previous = (COSArray) tokens.get(j - 1);
                    for (int k = 0; k < previous.size(); k++) {
                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();
                            string = StringUtils.replaceOnce(string, searchString, replacement);
                            cosString.setValue(string.getBytes());
                        }
                    }
                }
            }
        }
        // now that the tokens are updated we will replace the page content stream.
        PDStream updatedStream = new PDStream(document);
        OutputStream out = updatedStream.createOutputStream();
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);
        page.setContents(updatedStream);
        out.close();
    }
    return document;
}

answered Apr 04 '16 at 13:42

mourphy

151
8

6

This code only works in very simple PDFs and doesn't change or (even worse) damages more complex ones. – mkl Apr 05 '16 at 06:36
6

https://pdfbox.apache.org/2.0/migration.html Why was the ReplaceText example removed? – Tilman Hausherr Apr 05 '16 at 06:46
3

This is explained at the very last section of the link you mentioned: https://pdfbox.apache.org/2.0/migration.html#why-was-the-replacetext-example-removed It's mainly due to character encoding and font issues. – maxxyme Nov 21 '16 at 17:42
@mkl What is your approach for replacing text in pdf files? Thanks. – The_Cute_Hedgehog Mar 27 '20 at 19:02
1

@The_Cute_Hedgehog *"What is your approach for replacing text in pdf files"* - optimally you don't. That's not what pdf is designed for. If you have to nonetheless, it depends in what you know about the documents with the text to replace. If you know that they are (and will remain) simple enough for the code in this answer, you do something like that code. If the documents are not that simple but still subject to certain conditions, you can try a combination of redaction and adding of new content. For arbitrary documents, though, you don't try this. – mkl Mar 28 '20 at 09:36
I used this approach to do a different sort of change: I had a book using a font color I did not like and managed to replace it adapting this code. This has no risk to corrupt the page layout. One note; I noticed that even adding a FlatDecode filter to createOutputStream() I get a larger file cause I'm able to compress only text, not the rest. – user1708042 Jun 21 '20 at 22:01
@mkl I found one library in ocrmypdf in python(which can add content stream for scanned pdf) same like can we do using PDFBOX? If yes please share some sample code. thanks – fascinating coder Sep 16 '20 at 07:36
I have no idea what that *ocrmypdf* does. If you only want to add additional content streams to a page, that's trivial with PDFBox, simply create a `PDPageContentStream` for the page with `AppendMode.APPEND`. – mkl Sep 16 '20 at 08:45
I will create new question on this. I will explain everything. – fascinating coder Sep 18 '20 at 12:09
@mkl Hi here it is. https://stackoverflow.com/questions/63964552/while-creating-the-text-layer-for-scanned-pdf-edit-the-text-without-messing-up-t – fascinating coder Sep 21 '20 at 01:12
For some reason the example doesnt work for me (I am using PDFBox 2.0.21). I tried to use the function like this: PDDocument pdfDocument = PDDocument.load(new File("test.pdf")); pdfDocument = replaceText(pdfDocument, "test", "test2"); pdfDocument.save("test_out.pdf"); pdfDocument.close(); – Nathan B Nov 17 '20 at 14:19

score 5 · Answer 2 · answered Nov 03 '17 at 05:17

I spent much time on coming up with a solution for this and ended up acquiring an Acrobat DC subscription so that I could create fields as placeholders for the text to be replaced. These fields in my case, were for customer information and order details so it was not very complex data, but the document was filled with pages of business related conditions and had a very complex layout.

Then I simply did this, which may be suitable for you.

private void update() throws InvalidPasswordException, IOException {
    Map<String, String> map = new HashMap<>();
    map.put("fieldname", "value to update");
    File template = new File("template.pdf");
    PDDocument document = PDDocument.load(template);
    List<PDField> fields = document.getDocumentCatalog().getAcroForm().getFields();
    for (PDField field : fields) {
        for (Map.Entry<String, String> entry : map.entrySet()) {
            if (entry.getKey().equals(field.getFullyQualifiedName())) {
                field.setValue(entry.getValue());
                field.setReadOnly(true);
            }
        }
    }
    File out = new File("out.pdf");
    document.save(out);
    document.close();
}

YMMV

Using AcroForm fields indeed is how PDF fill-ins should be done. But you do not need Acrobat to create fields, you can do that with PDFBox, too... (without the nice GUI, though.) — mkl, Nov 03 '17 at 08:10
Thx @mkl, I did realise that the fields can be created using pdfbox, but I could not work out how to place them in document exactly where they needed to be. — Tim Coy, Nov 08 '17 at 21:04
Any safe way to replace text in each footer / header? If I put a field in the footer it will only be one field and not repeat (shows only 1 field `pdftk the.pdf dump_data_fields`). — jcalfee314, Feb 23 '23 at 17:08

PDFBox 2.0 RC3 -- Find and replace text

2 Answers2

Linked