Dealing with multiple encoding schemes while downloading the XML feed

Question

I am trying to read the feed at the following URL:

http://www.chinanews.com/rss/scroll-news.xml

using request module. But I get stuff that has �� ʷ��)��(�й�)��޹�.

On reviewing the XML I see that the encoding is being set as <?xml version="1.0" encoding="gb2312"?>

But on trying to set the encoding to gb2312, I get the unknown encoding error.

request({
    url: "http://www.chinanews.com/rss/scroll-news.xml",
    method: "GET",
    headers: {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Host": "www.chinanews.com",
        "Accept-Language": "en-GB,en-US;q=0.8,en;q=0.6"
    },
    "gzip": true,
    "encoding": "utf8"
}, (err, resp, data) => {
    console.log(data);
});

Is there a way I could get the data irrespective of the encoding it has? How should I approach this?

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

You missed the concept of character encoding.

var iconv=require('iconv-lite'), request=require('request');
request({
    url: "http://www.chinanews.com/rss/scroll-news.xml",
    method: "GET",
    headers: {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Accept-Encoding": "gzip, deflate",
        "Host": "www.chinanews.com",
        "Accept-Language": "" // client accept language
    },
    gzip: true,
    encoding: null // or 'ascii'
}, (err, resp, body) => {
    console.log(iconv.decode(Buffer.from(body, 'ascii'), 'gb2312'));
});

chunk is a Buffer instance in node.js. According to the official documention, there are only

'ascii' - For 7-bit ASCII data only. This encoding is fast and will strip the high bit if set.

'utf8' - Multibyte encoded Unicode characters. Many web pages and other document formats use UTF-8.

'utf16le' - 2 or 4 bytes, little-endian encoded Unicode characters. Surrogate pairs (U+10000 to U+10FFFF) are supported.

'ucs2' - Alias of 'utf16le'.

'base64' - Base64 encoding. When creating a Buffer from a string, this encoding will also correctly accept "URL and Filename Safe Alphabet" as specified in RFC4648, Section 5.

'latin1' - A way of encoding the Buffer into a one-byte encoded string (as defined by the IANA in RFC1345, page 63, to be the Latin-1 supplement block and C0/C1 control codes).

'binary' - Alias for 'latin1'.

'hex' - Encode each byte as two hexadecimal characters.

currently supported by node.js include. To use the encodings not natively supported by node.js, use iconv, iconv-lite or other libraries to grab the character mapping table. This is very similar to this answer.

The Accept-Language implies the languages accepted by client. en-gb represents English (United Kingdom), but not Chinese. The Chinese one is zh-cn, zh, according to RFC 7231.

Tom Blodget · Answer 2 · 2017-11-27T01:37:20.547

The tricky part is to pass encoding as null to get a Buffer instead of a string.

encoding - encoding to be used on setEncoding of response data. If null, the body is returned as a Buffer.

—request

var request = require('request');
var legacy = require('legacy-encoding');


var requestSettings = {
    method: 'GET',
    url: 'http://www.chinanews.com/rss/scroll-news.xml',
    encoding: null,
};

request(requestSettings, function(error, response, body) {    
    var text = legacy.decode(body, 'gb2312');    
    console.log(text);
});

Again, in the context of the follow-up question, "
Is there a way I could detect encoding?"

By "detect" I hope you mean, find the declaration. (…as opposed to guessing. If you have to guess then you have a failed communication.) The HTTP response header Content-Type is the primary way to communicate the encoding (if applicable to the MIME type). Some MIME types allow the encoding to be declared within the content, as servers quite rightly defer to that.

In the case of your RSS response. The server sends Content-Type:text/xml. which is without an encoding override. And the content's XML declaration is <?xml version="1.0" encoding="gb2312"?> The XML specification has procedures for finding such a declaration. It basically amounts to reading with different encodings until the XML declaration becomes intelligible, and then re-read with the declared encoding.

var request = require('request');
var legacy = require('legacy-encoding');
var convert = require('xml-js');

// specials listed here: https://www.w3.org/Protocols/rfc1341/4_Content-Type.html
var charsetFromContentTypeRegex = (/charset=([^()<>@,;:\"/[\]?.=\s]*)/i).compile(); 

var requestSettings = {
    method: 'GET',
    url: 'http://www.chinanews.com/rss/scroll-news.xml',
    encoding: null,
};


request(requestSettings, function(error, response, body) {    
    var contentType = charsetFromContentTypeRegex.exec(response.headers['content-type'])
    var encodingFromHeader = contentType.length > 1 ? contentType[1] : null;

    var doc = convert.xml2js(body);
    var encoding = doc.declaration.attributes.encoding;
    doc = convert.xml2js(
        legacy.decode(body, encodingFromHeader ? encodingFromHeader : encoding));
    // xpath /rss/channel/title
    console.log(doc.elements[1].elements[0].elements[0].elements[0].text); 
});

Is there a way I could detect encoding? There are numerous pages that the script needs to parse and they all could have different encoding schemes. So, I will never know in advance about the encoding — Suhail Gupta, Nov 24 '17 at 06:56

Dealing with multiple encoding schemes while downloading the XML feed

2 Answers2