1

I want to export a string(chinese text) to CSV file inside a zip file. Where do I need to set the encoding to UTF-8? Or what approach should I take (based on the code below) to display chinese characters in the exported CSV file?

This is the code I currently have.

        ByteArrayOutputStream out = new ByteArrayOutputStream();
        ZipOutputStream zipOut = new ZipOutputStream(out, StandardCharsets.UTF_8)
        try {
            ZipEntry entry = new ZipEntry("chinese.csv");
            zipOut.putNextEntry(entry);
            zipOut.write("类型".getBytes());
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            zipOut.close();
            out.close();
        }

Instead of "类型", I get "类型" in the CSV file.

  • 2
    There is a `getBytes(Charset charset)` method . See : https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#getBytes-java.nio.charset.Charset- – Arnaud Oct 31 '19 at 13:11
  • @Arnaud I tried ```zipOut.write("类型".getBytes(UTF_8)); ``` and it still does not work. – Isaia Bejan Oct 31 '19 at 13:38

2 Answers2

1

The getBytes() method is one culprit, without an explicit charset it takes the default character set of your machine. As of the Java String documentation:

getBytes()
Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

getBytes(string charsetName)
Encodes this String into a sequence of bytes using the given charset, storing the result into a new byte array.

Furthermore, as @Slaw pointed out, make sure that you compile (javac -encoding <encoding>) your files with the same encoding the files are in:

-encoding Set the source file encoding name, such as EUC-JP and UTF-8. If -encoding is not specified, the platform default converter is used.

A call to closeEntry() was missing in the OP btw. I stripped the snippet down to what I found necessary to achieve the desired funcitonality.

    try (FileOutputStream fileOut = new FileOutputStream("out.zip");
         ZipOutputStream zipOut = new ZipOutputStream(fileOut)) {
        zipOut.putNextEntry(new ZipEntry("chinese.csv"));
        zipOut.write("类型".getBytes("UTF-8"));
        zipOut.closeEntry();
    }

Finally, as @MichaelGantman pointed out, you might want to check what is in which encoding using a tool like a hex-editor for example, also to rule out that the editor you view the result file in displays correct utf-8 in a wrong way. "类" in utf-8 is (hex) e7 b1 bb in utf-16 (the java default encoding) it is 7c 7b

Curiosa Globunznik
  • 3,129
  • 1
  • 16
  • 24
  • I tried ```zipOut.write("类型".getBytes(UTF_8)); ``` and it still does not work. – Isaia Bejan Oct 31 '19 at 13:39
  • @IsaiaBejan Is your source file saved _and compiled_ using UTF-8 encoding? – Slaw Oct 31 '19 at 14:20
  • 2
    @curiosa It's a string literal. In this case, if you don't save the source file using UTF-8 then the characters are not saved correctly. Then you have to tell `javac` to use UTF-8 otherwise it won't read the characters correctly when compiling the code (e.g. `javac -encoding UTF-8 ...`). – Slaw Oct 31 '19 at 14:22
  • thanks for the answer, apparently it was a problem with the editor, in notepad it was displayed correctly. – Isaia Bejan Oct 31 '19 at 17:52
1

First, you definitely need to change zipOut.write("类型".getBytes()); to zipOut.write("类型".getBytes(StandardCharsets.UTF_8)); Also, when you open your resultant CSV file, the editor might not be aware that the content is encoded in UTF-8. You may need to tell your editor that it is UTF-8 encoding. For instance, in Notepad, you can save your file with "Save As" option and change encoding to UTF-8. Also, your issue might be just wrong display issue rather than actual encoding. There is an Open Source Java library that has a utility that converts any String to Unicode Sequence and vice-versa. This utility helped me many times when I was working on diagnosing various charset related issues. Here is the sample what the code does

result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);

The output of this code is:

\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World

The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc

Here is javadoc for the class StringUnicodeEncoderDecoder

I tried your inputs and got this:

System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("类型"));
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("类型"));

And the output was:

\u7c7b\u578b
\u00e7\u00b1\u00bb\u00e5\u017e\u2039

So it looks like you did lose the info, and it is not just a display issue

Michael Gantman
  • 7,315
  • 2
  • 19
  • 36