0

I need to read the strings from PDF file and replace it with the Unicode text.If it is ASCII chars everything is fine. But with Unicode characters, it showing question marks/junk text.No problem with font file(ttf) I am able to write a unicode text to the pdf file with a different class (PDFContentStream). With this class, there is no option to replace text but we can add new text.

Sample unicode text

Bɐɑɒ

issue (Address column)

https://drive.google.com/file/d/1DbsApTCSfTwwK3txsDGW8sXtDG_u-VJv/view?usp=sharing

I am using PDFBox. Please help me with this.....

check the code I am using.....

    enter image description herepublic static PDDocument _ReplaceText(PDDocument document, String searchString, String replacement)
        throws IOException {
    if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
        return document;
    }

    for (PDPage page : document.getPages()) {

        PDResources resources = new PDResources();
        PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
        //PDFont font2 = PDType0Font.load(document, new File("avenir-next-regular.ttf"));
        resources.add(font);
        //resources.add(font2);
        //resources.add(PDType1Font.TIMES_ROMAN);
        page.setResources(resources);
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        List tokens = parser.getTokens();

        for (int j = 0; j < tokens.size(); j++) {
            Object next = tokens.get(j);
            if (next instanceof Operator) {
                Operator op = (Operator) next;

                String pstring = "";
                int prej = 0;

                // Tj and TJ are the two operators that display strings in a PDF
                if (op.getName().equals("Tj")) {
                    // Tj takes one operator and that is the string to display so lets update that
                    // operator
                    COSString previous = (COSString) tokens.get(j - 1);
                    String string = previous.getString();
                    string = string.replaceFirst(searchString, replacement);
                    previous.setValue(string.getBytes());
                } else if (op.getName().equals("TJ")) {
                    COSArray previous = (COSArray) tokens.get(j - 1);
                    for (int k = 0; k < previous.size(); k++) {
                        Object arrElement = previous.getObject(k);
                        if (arrElement instanceof COSString) {
                            COSString cosString = (COSString) arrElement;
                            String string = cosString.getString();

                            if (j == prej) {
                                pstring += string;
                            } else {
                                prej = j;
                                pstring = string;
                            }
                        }
                    }

                    if (searchString.equals(pstring.trim())) {
                        COSString cosString2 = (COSString) previous.getObject(0);
                        cosString2.setValue(replacement.getBytes());

                        int total = previous.size() - 1;
                        for (int k = total; k > 0; k--) {
                            previous.remove(k);
                        }
                    }
                }
            }
        }

        // now that the tokens are updated we will replace the page content stream.
        PDStream updatedStream = new PDStream(document);
        OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
        ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
        tokenWriter.writeTokens(tokens);
        out.close();
        page.setContents(updatedStream);
    }

    return document;
}
  • 1
    Are you aware that text extraction works with PDFTextStripper class? What you do is to work directly on the content stream. About 50% of the PDFBox code is about making a real text out of that. – Tilman Hausherr May 27 '18 at 07:28

1 Answers1

0

Your code utterly breaks the PDF, cf. the Adobe Preflight output:

Preflight output

The cause is obvious, your code

PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
resources.add(font);
page.setResources(resources);

drops the pre-existing page Resources and your replacement contains only a single font the name of which you allow PDFBox to choose arbitrarily.

You must not drop existing resources as they are used in your document.


Inspecting the content of your PDF page it becomes obvious that the encoding of the originally used fonts T1_0 and T1_1 either is a single byte encoding or a mixed single/multi-byte encoding; the lower single byte values appear to be encoded ASCII-like.

I would assume that the encoding is WinAnsiEncoding or a subset thereof. As a corollary your task

to read the strings from PDF file and replace it with the Unicode text

cannot be implemented as a simple replacement, at least not with arbitrary Unicode code points in mind.


What you can implement instead is:

  • First run your source PDF through a customized text stripper which instead of extracting the plain text searches for your strings to replace and returns their positions. There are numerous questions and answers here that show you how to determine coordinates of strings in text stripper sub classes, a recent one being this one.
  • Next remove those original strings from your PDF. In your case an approach similar to your original code above (without dropping the resource, obviously), replacing the strings by equally long strings of spaces might work even it is a dirty hack.
  • Finally add your replacements at the determined positions using a PDFContentStream in append mode; for this add your new font to the existing resources.

Please be aware, though, that PDF is not designed to be used like this. Template PDFs can be used as background for new content, but attempting to replace content therein usually is a bad design leading to trouble. If you need to mark positions in the template, use annotations which can easily be dropped during fill-in. Or use AcroForm forms, the native PDF form technology, to start with.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Hi mkl, suggest me an approach how to do this?? Generating pdf dynamically will be difficult as we need to play with coordinates... – Aneesh Reddy Jun 05 '18 at 07:37
  • Have a look at the *"What you can implement instead is:"* part of my answer: You first determine the coordinates using a customized `PDFTextStripper` which instead of extracting the plain text searches for your strings to replace and returns their positions. There are numerous Q&A here on the topic of extracting coordinates in a `PDFTextStripper` derived class. – mkl Jun 05 '18 at 07:54