I was trying to print out the HTML text for https://top.baidu.com and https://www.qq.com, which both use GB2312 character encoding. It prints normally to the console except for the Chinese characters, which come out as unreadable text like ��㿴���ģ�ȫ�й��...
However, the Chinese characters come out just fine when I change the address to https://www.sina.com.cn or https://world.taobao.com, both of which use UTF-8.
Other than nicely asking Baidu and QQ to switch to UTF-8, is there anything I can do about this? Here is my code.
try {
String address1 = "https://top.baidu.com"; //unreadable
String address2 = "https://www.qq.com"; //also unreadable
String address3 = "https://www.sina.com.cn"; //readable
String address4 = "https://world.taobao.com"; //readable, too
URL url = new URL(address1);
StringBuilder htmlText = new StringBuilder();
HttpURLConnection connection = (HttpURLConnection) url.openConnection();
InputStream stream = connection.getInputStream();
InputStreamReader reader = new InputStreamReader(stream);
int data = reader.read();
while (data != -1) {
char current = (char) data;
htmlText.append(current);
data = reader.read();
}
System.out.println(htmlText);
} catch (Exception e) {
e.printStackTrace();
}