I have been given a piece of text representing HTML e.g.:
<html>\r\n<head>\r\n<meta http-equiv=3D\"Content-Type\" content=3D\"text/html; charset=3DWindows-1=\r\n252\">\r\n<style type=3D\"text/css\" style=3D\"display:none;\"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D\"ltr\">This should be a pound sign: =A3 and this should be a long dash: =96 \r\n</body>\r\n</html>\r\n
From the HTML <meta>
tag I can see that the piece of HTML should be encoded as Windows-1252.
I am using node.js to parse this piece of text with cheerio
. However decoding it with https://github.com/mathiasbynens/windows-1252 is not helping: windows1252.decode(myString);
is giving back the same input string.
The reason I think is because that input string is already encoded in the standard node.js charset, but it actually represents a windows-1252
encoded piece of HTML (if that makes sense?).
Checking those strange HEX numbers prepend by =
I can see valid windows-1252
codes e.g.:
- this
=\r\n
and this\r\n
should somehow represent a carriage return in the Windows world, =3D
: HEX3D
is DEC61
which is an equals sign:=
,=96
: HEX96
is DEC150
which is an 'en dash' sign:–
(some sort of "long minus symbol"),=A3
: HEXA3
is DEC163
which is a pound sign:£
I don't have control in the generation of that piece of HTML, but I am supposed to parse it and clean it giving back £
(instead of =A3
) etc.
Now, I know I could keep an in memory map with the conversions, but I was wondering if there is already a programmatic solution that covers the whole windows-1252
charset?
Cf. this for the whole conversion table: https://www.w3schools.com/charsets/ref_html_ansi.asp
Edit:
The input HTML comes from a IMAP session, so it seems there's a 7bit/8bit "quoted printable encoding" going on upstream that I can not control (cf https://en.wikipedia.org/wiki/Quoted-printable).
In the meanwhile I became aware of this extra encoding and I've tried this quoted-printable
(cf. https://github.com/mathiasbynens/quoted-printable) library with no luck.
The following is an MCV (as per request):
var cheerio = require('cheerio');
var windows1252 = require('windows-1252');
var quotedPrintable = require('quoted-printable');
const inputString = '<html>\r\n<head>\r\n<meta http-equiv=3D\"Content-Type\" content=3D\"text/html; charset=3DWindows-1=\r\n252\">\r\n<style type=3D\"text/css\" style=3D\"display:none;\"><!-- P {margin-top:0;margi=\r\nn-bottom:0;} --></style>\r\n</head>\r\n<body dir=3D\"ltr\">This should be a pound sign: =A3 and this should be a long dash: =96 \r\n</body>\r\n</html>\r\n'
const $ = cheerio.load(inputString, {decodeEntities: true});
const bodyContent = $('html body').text().trim();
const decodedBodyContent = windows1252.decode(bodyContent);
console.log(`The input string: "${bodyContent}"`);
console.log(`The output string: "${decodedBodyContent}"`);
if (bodyContent === decodedBodyContent) {
console.log('The windows1252 output seems the same of as the input');
}
const decodedQp = quotedPrintable.decode(bodyContent)
console.log(`The decoded QP string: "${decodedQp}"`);
The previous script is producing the following output:
The input string: "This should be a pound sign: =A3 and this should be a long dash: =96"
The output string: "This should be a pound sign: =A3 and this should be a long dash: =96"
The windows1252 output seems the same of as the input
The decoded QP string: "This should be a pound sign: £ and this should be a long dash: "
On my command line I can not see the long dash and I am not sure how I could properly decode all these =<something>
encoded characters?