Getting illegal character after replacing a string through pdfbox

Question

I've followed this answer and wrote program,In pdf I'm trying to replace word datalog with ddddddd.All the occurrence got successfully replaced.But the problem is that at some places where "- " is present got replaced by illegal character Å’.The word datalog is in page no 3,5.But the i got this illegal character at page no 4.I want to know why did i get that character.any help would be highly appreciated.

import java.io.*;
import java.util.*;

import org.apache.pdfbox.cos.COSArray;
import org.apache.pdfbox.cos.COSString;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.pdfwriter.ContentStreamWriter;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageTree;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.pdmodel.font.PDFont;
import org.apache.pdfbox.pdmodel.font.PDTrueTypeFont;

public class SimpleReplace {

    public static void main (String[] args) throws Exception {

        PDDocument document = null;
        String fileName ="";
        try {
            document = PDDocument.load( new File(fileName),"" );
            document.setAllSecurityToBeRemoved(true);

            String outputFileName = "SimpleReplace.pdf";
            // the encoding will need to be adapted to your circumstances
            //String encoding = "ISO-8859-1";
            String encoding = "ISO-8859-1";

            // Note that search and replace can be regular expressions
            // replace all occurrences of 'Hello'
           searchReplace("    ", "Aaaa Aaaaa Aaa", encoding,true, document);
            // replace only first occurrence of 'World'
            // Save the results and ensure that the document is properly closed
            document.save(outputFileName);
        }
        finally {
            if( document != null ) {
                document.close();
            }
        }

    }

    private static void searchReplace (String search, String replace,String encoding ,boolean replaceAll, PDDocument doc) throws IOException {
        PDPageTree pages = doc.getDocumentCatalog().getPages();
        for (PDPage page : pages) {
              int count=0;
            PDFStreamParser parser = new PDFStreamParser(page);
            parser.parse();
            List tokens = parser.getTokens();
            for (int j = 0; j < tokens.size(); j++) {
                Object next = tokens.get(j);
                if (next instanceof Operator) {
                    Operator op = (Operator) next;
                    // Tj and TJ are the two operators that display strings in a PDF
                    // Tj takes one operator and that is the string to display so lets update that operator
                    if (op.getName().equals("Tj")) {
                        COSString previous = (COSString) tokens.get(j-1);
                        String string = previous.getString();
                        if (replaceAll) {
                            string = string.replaceAll(search, replace);

                        }
                        else
                            string = string.replaceFirst(search, replace);
                        previous.setValue(string.getBytes());
                    } else if (op.getName().equals("TJ")) {
                        COSArray previous = (COSArray) tokens.get(j-1);
                        for (int k = 0; k < previous.size(); k++) {
                            Object arrElement = previous.getObject(k);
                            if (arrElement instanceof COSString) {
                                COSString cosString = (COSString) arrElement;
                                String string = cosString.getString();
                                if (replaceAll)
                                    string = string.replaceAll(search, replace);
                                else
                                    string = string.replaceFirst(search, replace);
                                cosString.setValue(string.getBytes());
                            }
                        }
                    }
                }
            }
            // now that the tokens are updated we will replace the page content stream.
            PDStream updatedStream = new PDStream(doc);
            OutputStream out = updatedStream.createOutputStream();
            ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
            tokenWriter.writeTokens(tokens);
            out.close();
            page.setContents(updatedStream);
        }
    }
}

Would be better if u say reason behind downvote!!Just asking what would be reason behind that illegal character — saiKrishna kingdom, Oct 24 '19 at 08:22
I didn't downvote and I agree with you that one should provide a reason. However read this: https://pdfbox.apache.org/2.0/migration.html The problem you got is likely related to that. Consider linking to the PDF (before and after) for a look. (You can also look at it with PDFDebugger yourself) — Tilman Hausherr, Oct 24 '19 at 10:50
I haven't downvoted yet either (most likely those people meanwhile moved on to other questions for good) but I really feel tempted to, after all the comments to the answer you reference clearly warn you that the code therein only works occasionally... — mkl, Oct 24 '19 at 22:31
@TilmanHausherr not sure whether the problem is with my compiler or with my code.My code isn't working for one case.can u please test it ,if i share my document and code? if possible?only for one case — saiKrishna kingdom, Oct 25 '19 at 12:17
Yes, you're assumed to put your code into the question (you can edit). Attach the PDFs by including a link to them. (the original and the result PDF) — Tilman Hausherr, Oct 25 '19 at 12:40
As mentioned in comments to the answer from which you took the code: *This code only works in very simple PDFs and doesn't change or (even worse) damages more complex ones.* Your PDF is not simple enough. — mkl, Oct 25 '19 at 14:17
is there any way i can work with complex pdf?ive downloaded pdfbox debbuger. any link how to work with it>? — saiKrishna kingdom, Oct 25 '19 at 15:18
There is no documentation, just click on everything and see what happens. Have a look at the page content stream, and the PDF specification. A good start there is "operator summary". I still wonder what happened, because if "d" existed before, "dddddd" shouldn't make a problem. Unless here "ddddd" is a placeholder what a different word, e.g. a company name which has a character that didn't exist in the font. — Tilman Hausherr, Oct 25 '19 at 15:56
To give you a hint about the complications: the different fonts in your PDF use ad-hoc encodings, e.g. the encoding of the font used for the vertical "ESIC-RSANDHYA-..." on the left does not contain a 'J', so in the content stream it is "EIRC-QRAMDHXA-...". Other fonts have different ad-hoc encodings. Thus, to replace like you do, you must take the encoding in question into account, and you must make sure that your replacement string does not contain characters missing in the encoding at hand. And "Meghana" in the content stream looks like "M^`aYfY"... — mkl, Oct 25 '19 at 17:20

Getting illegal character after replacing a string through pdfbox

0 Answers0