-1

I was trying to print out the HTML text for https://top.baidu.com and https://www.qq.com, which both use GB2312 character encoding. It prints normally to the console except for the Chinese characters, which come out as unreadable text like ��㿴���ģ�ȫ�й��...

However, the Chinese characters come out just fine when I change the address to https://www.sina.com.cn or https://world.taobao.com, both of which use UTF-8.

Other than nicely asking Baidu and QQ to switch to UTF-8, is there anything I can do about this? Here is my code.

    try {
        String address1 = "https://top.baidu.com"; //unreadable
        String address2 = "https://www.qq.com"; //also unreadable
        String address3 = "https://www.sina.com.cn"; //readable
        String address4 = "https://world.taobao.com"; //readable, too

        URL url = new URL(address1);
        StringBuilder htmlText = new StringBuilder();
        HttpURLConnection connection = (HttpURLConnection) url.openConnection();
        InputStream stream = connection.getInputStream();
        InputStreamReader reader = new InputStreamReader(stream);
        int data = reader.read();

        while (data != -1) {
            char current = (char) data;
            htmlText.append(current);
            data = reader.read();
        }
        System.out.println(htmlText);

    } catch (Exception e) {
        e.printStackTrace();
    }
K Man
  • 602
  • 2
  • 9
  • 21
  • 2
    `new InputStreamReader(stream)` - [Javadoc](https://docs.oracle.com/javase/8/docs/api/java/io/InputStreamReader.html#InputStreamReader-java.io.InputStream-) says: *Creates an InputStreamReader that uses the **default charset**.* --- Not the character set as specified in the `content-type: text/html; charset=GB2312` HTTP header of the response. – Andreas Apr 30 '20 at 00:20

1 Answers1

0

After reading Andreas's comment, I looked up an alternative constructor for InputStreamReader and came up with the following.

InputStreamReader reader = new InputStreamReader(stream, Charset.forName("GB2312"));
K Man
  • 602
  • 2
  • 9
  • 21