Result of program using pdfbox built with maven-shade-plugin is different than normal NetBeans Run

Question

I have program in java which uses PDFBox 1.7.1 and it is build with maven-shade-plugin 2.0.

Here is the code which uses PDFBox api:

public class PdfFile {

    protected PDDocument document = null;

    public boolean load(byte[] bytes) throws IOException {
        InputStream is = new ByteArrayInputStream(bytes);
        PDFParser parser = new PDFParser(is);
        parser.parse();
        COSDocument cosDoc = parser.getDocument();
        this.document = new PDDocument(cosDoc);
        return true;
    }

    public byte[] extractText() throws IOException {
        PDFTextStripper pdfStripper = new PDFTextStripper();
        byte[] text = pdfStripper.getText(this.document).getBytes();

        return text;
    }

    public void close() throws IOException {
        if(this.document != null) {
            this.document.close();
        }
    }
}

So basicly method load() loads pdf document from byte array and method extractText() returns text extracted from PDF as a byte array. It works when I run program from NetBeans Run button, but when I run it from single jar built with maven-shade-plugin the returned text is in wrong character encoding. For example word:

odpowiadająca (normal polish characters)
odpowiadajšca (netbeans run)
odpowiadajÄca (single shade jar)

I know it's exactly same file (byte array) which comes as argument to PdfFile.load() on both runs. So the problem is with PDF box returning text in two different formats...

I have 3 questions:

Why in jar built with shade plugin encoding is different?
How I can controll/set the encoding used by jar built with shade plugin?
How I can force PDF box to return text in correct format?

I know that in command line PDFBox there is option to set encoding:

java -jar {$jar_path} ExtractText -encoding UTF-8

But I can't find it in PdfBox api...

Solved: I had to change

pdfStripper.getText(this.document).getBytes();

to

pdfStripper.getText(this.document).getBytes("UTF8");

score 2 · Accepted Answer · answered Feb 02 '13 at 15:37

First, here is 2 facts (about your question 2):

According this code : the default output encoding is UTF-8.
There is a PDFTextStripper constructor taking the output encoding as an argument.

For question 1 and 3:

I think your problem is more related to the way you transform the byte[] returned by extractText() into a String.

new String(byte[]) use the platform encoding. So, doing this within netbeans or in shell can give different results since I expect that the platform encoding can be different when running within Netbeans.

Posting the code handling the result of your extractText() can be helpful.

Thanks, you were right about string - In above code I use `pdfStripper.getText(this.document).getBytes();` which is `String.getBytes()` - I had to change this line to `pdfStripper.getText(this.document).getBytes("UTF8");` and it solved problem, thanks! — user606521, Feb 03 '13 at 10:38

Result of program using pdfbox built with maven-shade-plugin is different than normal NetBeans Run

1 Answers1