KRL RSS parser: Handle encoding issues?

Question

I'm importing an RSS feed from Tumblr into a Kynetx app. It appears that the RSS feed has some encoding issues, as apostrophes appear like this:

Apostrophes encoded incorrectly

The feed (which you can find here) claims to be encoded in UTF-8.

Is there a way to specify the encoding or else replace those characters with regular apostrophes?

That's good UTF-8. (It's the [right single quote](http://www.fileformat.info/info/unicode/char/2019/index.htm), not a standard apostrophe.) Your client is parsing the feed as if it were in the default charset, not as if it were UTF-8. — dkarp, Jan 19 '11 at 23:45
I think you mean, how do you get KRL to parse this feed as UTF-8 instead of parsing it as windows-1252? And I have no idea what the answer to that is. — dkarp, Jan 20 '11 at 02:06

score 2 · Accepted Answer · answered Jan 20 '11 at 00:16

While not optimal, you could try to catch these encodings and replace them with the UTF-8 standard:

newstring = oldstring.replace(re/â€™/\'/);

Windows special chars

This appears to be a case of a service that specifies UTF-8, but does't explicitly enforce it. I uploaded an image of the RSS feed that you provided. For comparison, I cut and pasted the text into a notepad document and then typed in the same text from my keyboard.

I don't know if you can tell from the image, but the apostrophe that is mangled is different from the apostrophe that is generated by my UTF-8 browser.

I suspect that this post was submitted via a Windows client. If you look at your encoding options, you will see an option for Western (Windows-1252).

Windows-1252 is a legacy encoding from windows that resembles ISO 8859-1, but substitutes some of their own characters for control characters in the ANSI standard and changes the location in the codepage of others.

A couple of quotes from the wikipedia page that I cite above:

It is very common to mislabel Windows-1252 text data with the charset label ISO-8859-1. Many web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 characters in order to accommodate such mislabeling

Many Microsoft programs, such as Word will automatically substitute Windows-1252 characters when standard ASCII characters are entered, such as for "smart quotes" (e.g. substituting ’ for the apostrophe in a contraction) or substituting © for the three characters '(c)'.

KRL supports all of the language charsets supported by UTF-8, so it supports multi-byte international characters natively; however, that comes at the expense of being able to fudge encodings that is possible when you only have ISO-8859-1 or Windows-1252 to choose from.

The fact that the remote server is returning the three bytes `â€™` means that yes, the original post contained the magic Windows right-quote instead of an apostrophe... but also that it's been picked up properly. Those three bytes are proper UTF-8 for the [right single quote](http://www.fileformat.info/info/unicode/char/2019/index.htm) character: `0xE2 0x80 0x99`. But you see those 3 characters because instead of parsing the stream as UTF-8 (which it is!), KRL is parsing it as windows-1252. That's the problem. — dkarp, Jan 20 '11 at 02:08
I had to use this syntax to get it to compile: `oldstring.replace(re/â€™/, "'");`. But even with that, it didn't seem to work; the three characters don't get replaced. — Steve Nay, Jan 20 '11 at 03:59

KRL RSS parser: Handle encoding issues?

1 Answers1

Linked