I have program in java which uses PDFBox 1.7.1 and it is build with maven-shade-plugin 2.0.
Here is the code which uses PDFBox api:
public class PdfFile {
protected PDDocument document = null;
public boolean load(byte[] bytes) throws IOException {
InputStream is = new ByteArrayInputStream(bytes);
PDFParser parser = new PDFParser(is);
parser.parse();
COSDocument cosDoc = parser.getDocument();
this.document = new PDDocument(cosDoc);
return true;
}
public byte[] extractText() throws IOException {
PDFTextStripper pdfStripper = new PDFTextStripper();
byte[] text = pdfStripper.getText(this.document).getBytes();
return text;
}
public void close() throws IOException {
if(this.document != null) {
this.document.close();
}
}
}
So basicly method load()
loads pdf document from byte array and method extractText()
returns text extracted from PDF as a byte array. It works when I run program from NetBeans Run
button, but when I run it from single jar built with maven-shade-plugin the returned text is in wrong character encoding. For example word:
odpowiadająca (normal polish characters)
odpowiadajšca (netbeans run)
odpowiadajÄca (single shade jar)
I know it's exactly same file (byte array) which comes as argument to PdfFile.load()
on both runs. So the problem is with PDF box returning text in two different formats...
I have 3 questions:
- Why in jar built with shade plugin encoding is different?
- How I can controll/set the encoding used by jar built with shade plugin?
- How I can force PDF box to return text in correct format?
I know that in command line PDFBox there is option to set encoding:
java -jar {$jar_path} ExtractText -encoding UTF-8
But I can't find it in PdfBox api...
Solved: I had to change
pdfStripper.getText(this.document).getBytes();
to
pdfStripper.getText(this.document).getBytes("UTF8");