In my java web application when I upload a Zip file (thread dump), I get inputstream
in servlet. I use the Zip4j library to unzip the file and then write it into a file. This zip file has multi encoded content (UTF-8, windows-1252, ISO-8859-1, ISO-8859-2, IBM424_rtl). When I open the output file, I see some characters like this Mac OS X 2 € ² ATTR ² ˜
Here is a sample code. Can you please let me know how can I fix this issue?
// Using Zip4j library to uncompress ZIP format
ZipInputStream zis = new ZipInputStream(iStream);
FileOutputStream zos = new FileOutputStream("output_file.txt");
ByteArrayOutputStream out = new ByteArrayOutputStream();
LocalFileHeader localFileHeader = zis.getNextEntry();
while (localFileHeader != null) {
if(localFileHeader.isDirectory()) {
localFileHeader = zis.getNextEntry();
continue;
}
IOUtils.copy(zis, out);
localFileHeader = zis.getNextEntry();
}
InputStreamReader isr = new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
BufferedReader reader = new BufferedReader(isr);
String str;
while ((str = reader.readLine()) != null) {
// This is a custom method that will return the charset of the input string using apache tikka library
String encoding = CharsetDetector.detectCharset(str);
zos.write(str.getBytes(encoding));
zos.write("\n".getBytes());
}
isr.close();
reader.close();
zos.close();
zis.close();
// Method is used to detect charset
public static String detectCharset(String text) throws IOException {
org.apache.tika.parser.txt.CharsetDetector detector = new org.apache.tika.parser.txt.CharsetDetector();
detector.setText(text.getBytes());
String charset = detector.detect().getName();
return charset;
}
Note: I am running application on windows machine.
Thanks in advance!