0

In my java web application when I upload a Zip file (thread dump), I get inputstream in servlet. I use the Zip4j library to unzip the file and then write it into a file. This zip file has multi encoded content (UTF-8, windows-1252, ISO-8859-1, ISO-8859-2, IBM424_rtl). When I open the output file, I see some characters like this Mac OS X 2 € ² ATTR ² ˜

Here is a sample code. Can you please let me know how can I fix this issue?

// Using Zip4j library to uncompress ZIP format 
ZipInputStream zis = new ZipInputStream(iStream);

FileOutputStream zos = new FileOutputStream("output_file.txt");
ByteArrayOutputStream out = new ByteArrayOutputStream();
        
LocalFileHeader localFileHeader = zis.getNextEntry();
while (localFileHeader != null) {
            
    if(localFileHeader.isDirectory()) {
            
        localFileHeader = zis.getNextEntry();
        continue;
    }
            
    IOUtils.copy(zis, out);
    localFileHeader = zis.getNextEntry();
}
        
InputStreamReader isr = new InputStreamReader(new ByteArrayInputStream(out.toByteArray()));
BufferedReader reader = new BufferedReader(isr);
        
String str;
while ((str = reader.readLine()) != null) {
    
    // This is a custom method that will return the charset of the input string using apache tikka library      
    String encoding = CharsetDetector.detectCharset(str);
            
    zos.write(str.getBytes(encoding));
    zos.write("\n".getBytes());
}
          
isr.close();
reader.close();
zos.close();
zis.close();

// Method is used to detect charset
public static String detectCharset(String text) throws IOException {
    
    org.apache.tika.parser.txt.CharsetDetector detector = new org.apache.tika.parser.txt.CharsetDetector();
    detector.setText(text.getBytes());
    String charset = detector.detect().getName();
    
    return charset;
}

Note: I am running application on windows machine.

Thanks in advance!

Mahesh
  • 103
  • 1
  • 10
  • 1
    A `CharsetDetector` that acts on a `String` is already fundamentally at a loss, because that `String` quite possibly already lost some data from the original conversion of `byte[]` to `String` with the wrong encoding (presumably wrong, since it isn't wrong before the `CharsetDetector` works). Any `CharsetDetector` that wants to have a realistic chance of being good has to take `byte[]` (or some equivalent type). Also: what do you mean by "mutli-encoded"? Is it a text file where various parts can have different encodings? Or some binary format? Or you simply don't know? – Joachim Sauer Sep 15 '21 at 15:17
  • 1
    I just saw that you are using Tika to build that CharsetDecoder. You'll note that th Tika CharsetDecoder takes in a `byte[]` or an `InputStream` for exactly the reason I mentioned above. By converting your `byte[]` data to `String`s before detecting the encoding you're ruining your chances of doing it right. – Joachim Sauer Sep 15 '21 at 15:19
  • Zip file contains some .txt and .File extension files. When I debugged, the ```CharsetDetector``` was returning different types of charset like UTF-8, windows-1252, ISO-8859-1, ISO-8859-2, IBM424_rtl for some lines. – Mahesh Sep 15 '21 at 15:31
  • 3
    Just because the detector outputs something doesn't mean that the file is actually in those encodings. Character set detection is a finicky imprecise science even in the best of cases (i.e. when not mangling the content by attempting a random character encoding first). I've personally not yet found any text files that intentionally contained multiple different lines. My guess is that you're trying to interpret something as text which is not actually text. In fact the "Mac OS X" and "ATTR" in your output suggest you're stumbling over the infamous `._` files that archives created on Mac contain. – Joachim Sauer Sep 15 '21 at 15:36
  • Thanks for your advice. So you are saying instead of converting ```byte[]``` to ```String``` I should pass ```byte[]``` or ```inputstream``` to ```CharsetDetector```. Am I correct? – Mahesh Sep 15 '21 at 15:37
  • 1
    That is one important step to improve your `CharsetDetector`, yes. But I don't think this will fundamentally fix your problem, because the file you are trying to interpret as text very likely **simply is not a text file** (so there is no "correct encoding" that you could detect on it). [See this question for an explanation of what that file likely is](https://apple.stackexchange.com/questions/14980/why-are-dot-underscore-files-created-and-how-can-i-avoid-them). – Joachim Sauer Sep 15 '21 at 15:38
  • This zip file has java thread dumps and top output files and I think it was captured on MacOS. Some files are having names like ```._jstack.30131.132258.550327868```. – Mahesh Sep 15 '21 at 15:42
  • There's an 99% chance that any actual text file in that archive is simply UTF-8 encoded, since that's the default on Mac OS X (a very sensible choice, I might add). Any file that fails to decode as UTF-8 is likely to not be a text file at all. – Joachim Sauer Sep 15 '21 at 15:44
  • Also: if your goal is simply to copy the content into an external file: why go through the trouble of converting to String at all? Simply write the byte[] directly to the file and let the final consumers of the data care about the encoding. – Joachim Sauer Sep 15 '21 at 15:46
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/237140/discussion-between-mahesh-and-joachim-sauer). – Mahesh Sep 15 '21 at 15:49
  • Inside the zip file, I saw the ```__MACOSX``` directory and this directory has files ```._*```. If I remove the entire ```__MACOSX``` directory and upload the zip file then it is working. – Mahesh Sep 15 '21 at 16:02
  • Oh! Sorry, accidentally I posted a comment outside. – Mahesh Sep 15 '21 at 16:08

0 Answers0