0

I am trying to read a document using itext and replace a string in it. But once manipulated all spanish character becomes junk characters. Below is the code for changing the pdf.

    PdfReader     reader = new PdfReader(src);
    PdfDictionary dict   = reader.getPageN(1);
    PdfObject     object = dict.getDirectObject(PdfName.CONTENTS);
    if (object instanceof PRStream) {
        PRStream stream     = (PRStream) object;
        byte[]   data       = PdfReader.getStreamBytes(stream);
        String   dataString = new String(data);
        dataString = dataString.replace(sourceString, replacementString);
        stream.setData(dataString.getBytes("UTF-8"));
    }
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));
    stamper.close();
    reader.close();

In the actual pdf there is a string ${address-line-one} which I am replacing to "20th Street"

This works but with this Spanish word which is in the stream

Documentación becomes Documentaci�n

and same for other spanish word.

I also printed the bytes[] in java console, and found that the reading itself doesn't get that character properly.

Any suggestion?

SaChi
  • 1
  • 3

1 Answers1

1

You use

new String(data)

to turn the bytes into a string (using some default encoding) and

dataString.getBytes("UTF-8")

to turn the string back into bytes (using utf-8).

Thus, if the default encoding in the first operation does not match utf-8, these transformations will create artefacts as you see above.

So please use

new String(data, encoding)

and

dataString.getBytes(encoding)

instead.


That been said, utf-8 is a very inappropriate encoding here, use something along the lines of Latin-1 / ISO-8559-1 instead.


That been said your approach at editing the content will only work in very specific PDFs. In particular the encodings of the used fonts must be WinAnsiEncoding and lines or "fields" must be drawn in a single instruction each. Furthermore, your replacements must not be much longer than the replaced text and must not contain characters for which Latin-1 and WinAnsiEncoding differ or which have special meanings in PDFs, and you must make sure that you do not by chance change the instructions outside the strings.

mkl
  • 90,588
  • 15
  • 125
  • 265
  • Can't do ++1 as my reputation is not that high, but this worked for me::: String dataString = new String(data, "ISO-8859-1");dataString = dataString.replace(sourceString, replacementString);stream.setData(dataString.getBytes("ISO-8859-1")); – SaChi Nov 02 '17 at 13:25
  • Good. Beware of the warnings further down, though: If the producer of your PDFs will ever change, your code suddenly might stop working. By the way, you indeed cannot *upvote* but you can *accept* an answer, simply click the tick at its upper left, right below the voting arrows. – mkl Nov 02 '17 at 14:21
  • I myself is creating the pdf from IText and than changing text some where down the line, so I am the producer and consumer in this case :) thanks though – SaChi Nov 02 '17 at 16:01