0

I am currently doing some testing with the output CSV file for Shift-JIS format, but somehow i found it weird on the trials of differences japanese characters as below :

My code :

try {
        String dat2 = "カヨ ハラダ";
        String dat = "2バイト文字出力";
        String fileName = "C:/Users/CR/Desktop/test2.txt";

        FileOutputStream fos = new FileOutputStream(fileName);
        OutputStreamWriter osw = new OutputStreamWriter(fos, "Shift_JIS");
        BufferedWriter fp = new BufferedWriter(osw);

        fp.write(new String(dat2.getBytes("Shift_JIS"));
        fp.newLine();

        fp.flush();
        fp.close();
        fos.close();

    } catch (Exception ex) {
        throw new Exception(ex);
    }

Result for dat2 :

It was not in Shift-JIS formt

It was not in Shift-JIS format and the words are incorrect too whereby another trial :

Result for dat :

enter image description here

This can be displayed correctly and in the expected format too.

Anything went wrong ? or the content is not correct ?

Thanks !

crchin
  • 9,473
  • 6
  • 22
  • 28
  • In future, you may want to open the file in a Japanese editor like Sakura as Notepad++ has a tendency to ignore encodings and do whatever the hell it likes with files containing Japanese characters. Even changing settings to assume UTF-8 / SJIS still has it choose the wrong one most of the time (at least for me) and this led to 文字化け similar to your post appearing in Notepad++ but not Sakura. – The Wandering Coder Dec 12 '16 at 05:33

3 Answers3

1

Most of your code is good except for the line:

    fp.write(new String(dat2.getBytes("Shift_JIS"));

Java strings are (more or less) encoding neutral. The encoding comes into play when you write the string to a file (or send it over the net). In your case, the encoding conversion is handled by the OutputStreamWriter you have set up correctly.

So the line becomes simpler:

    fp.write(dat2);

BTW:

The expression

new String(dat2.getBytes("Shift_JIS")

first converts the string dat2 into a byte array in Shift_JIS encoding and then converts the byte array into a string using the default encoding (probably UTF-8), thereby decoding the byte array using the wrong encoding.

P.S.

One more thing. Text files like CSV files have no way to indicate what encoding was used to write them (exception: UTF with BOM). There are only heuristics to make a good guess. So when you open them in a text editor, you have to check whether they were opened with the correct encoding and fix it if necessary. In your first screen shot, it says "ANSI" in the status bar. That's hardly what you want.

Codo
  • 75,595
  • 17
  • 168
  • 206
  • Hi Codo, Thanks for your pointing out but unfortunately the result still the same. :( – crchin Aug 24 '16 at 11:12
  • Have you opened the file with Shift-JIS encoding in the text editor. Or is it still using ANSI encoding? – Codo Aug 24 '16 at 11:15
  • By default, the file with dat is opened in npp and it showed automatically in Shift-JIS. Both files should behave the same. Isn't ? – crchin Aug 24 '16 at 11:19
  • Look at your screenshots: one file is opened with ANSI, one with Shift-JIS. And see my P. S. addition in the answer. – Codo Aug 24 '16 at 11:21
  • I don't think the *default* is relevant here.Your *dat2* file was opened and displayed with ANSI (as the screenshot proves). – Codo Aug 24 '16 at 11:30
  • I mean both testing were using the same implementation to generate the csv file but somehow it showed one is Shift-JIS whereby another one in ANSI. I thought both files once open in text editor should show Shift-JIS. – crchin Aug 24 '16 at 11:42
0

It seems like the issue is caused by the japanese word - Full Width or Half Width Katakana Characters.

For my sample given on above, the dat is in Full Width and dat2 is in half width.

So I try to use ICU4J to convert half width to full width then it can successfully write into CSV with Shift-JIS format.

Transliterator transliterator = Transliterator.getInstance("Halfwidth-Fullwidth");
String converted = transliterator.transliterate("カヨ ハラダ"); 

The result as below :
カヨ ハラダ
crchin
  • 9,473
  • 6
  • 22
  • 28
0

I have run the program below:

import java.io.*;

public class Hoge {
    public static void main(String[] args) {
        try {
            {
                String dat = "2バイト文字出力";
                String fileName = "./FullWidth.txt";

                FileOutputStream fos = new FileOutputStream(fileName);
                OutputStreamWriter osw = new OutputStreamWriter(fos, "Shift_JIS");
                BufferedWriter fp = new BufferedWriter(osw);

                fp.write(new String(dat.getBytes("Shift_JIS")));
                fp.newLine();

                fp.flush();
                fp.close();
                fos.close();
            }
            {
                String dat2 = "カヨ ハラダ";
                String fileName = "./HalfWidth.txt";

                FileOutputStream fos = new FileOutputStream(fileName);
                OutputStreamWriter osw = new OutputStreamWriter(fos, "Shift_JIS");
                BufferedWriter fp = new BufferedWriter(osw);

                fp.write(new String(dat2.getBytes("Shift_JIS")));
                fp.newLine();

                fp.flush();
                fp.close();
                fos.close();
            }
        } catch (Exception ex) {
            // NOP
        }
    }
}

The content of FullWidth.txt is (in hex):

3F 51 3F 6F 3F 43 3F 67 3F 3F 3F 3F 3F 6F 3F 3F 0A

The string 2バイト in Shift JIS encoding should be 82 51 83 6F 83 43 83 67. So I think Notepad++ recognized the encoding as Shift JIS, and somehow recovered the first byte of each character.

On the other hand, the content of HalfWidth.txt is (in hex):

3F 3F 20 3F 3F 3F 3F 0A

So I think Notepad++ could not recognize the encoding of this file.

In short: Both of two files are wrong. Accidentally Notepad++ could recovered the content of one file, and could not recovered the content of the other file.

SATO Yusuke
  • 1,600
  • 15
  • 39