0

As far as I understand, a character encoding maps bits to integers and a character set maps integers to characters.

So in the Unicode character set there is a telephone character. It is represented using the integer 9742, more commonly represented using Hexadecimal as 260E. This is then saved to a file using UTF-8 which translates the integer 9742 into 10011000001110. Please correct me if I am wrong.

Yesterday I created a text file that used the Unicode character set and UTF-8 encoding and I saved it to my desktop. I then reopened the file in my text editor and started to manually switch the character sets for fun. Unsurprisingly there were problems and odd characters starting displaying! I noticed that only some of the characters are misrepresented though. This got me thinking, why do only some of the characters break? Why not all?

Someone told me that the characters breaking are those outside the original ASCII specification. Upon reflection this seemed to make sense, as it's only non US characters that break. I was told that because all character sets use the ASCII character set up to the first 128 characters they will remain unbroken, and that it's the characters above 127 that break. Please correct me if I am wrong.

Finally, I got thinking. Are there any character sets that don't respect ASCII? If so, what are they called and what are they used for?

  • 2
    Well UTF-16 for a start, where each BMP codepoint is two bytes, not one. Then EBCDIC... – Jon Skeet Mar 27 '17 at 12:01
  • 1
    Google "EBCDIC". – Paul R Mar 27 '17 at 12:01
  • 1
    See also [Baudot and ITA2](https://en.wikipedia.org/wiki/Baudot_code). – Paul R Mar 27 '17 at 12:15
  • 1
    BTW the character U+260E in UTF-8 is not 2 bytes but 3: https://mothereff.in/utf-8#%E2%98%8E. It's 2 bytes in UTF-16. – GOTO 0 Mar 27 '17 at 15:16
  • Although the history of technology and design choices is fascinating, the practical rules for character encodings are to communicate the encoding used for writing text and to expect the knowledge of which encoding to use for reading to come in advance of or with the encoded text. That makes any similarity between encodings moot. – Tom Blodget Mar 27 '17 at 17:17
  • 1
    Also see [cp1026](http://www.kreativekorp.com/charset/encoding.php?name=CP1026). I once ran into this bugger as the charset of an email, and it broke my email parser because it uses `0x25` instead of `0xA` for `LF`, but it also has a bunch of other characters in the `0x00 - 0x7F` range that are different than ASCII. – Remy Lebeau Mar 31 '17 at 00:43

1 Answers1

0

Based on my findings from the comments I am able to answer my own question. Thank you to everyone who commented!

Yes, there are a couple; EBCDIC and Baudot.

  • EBCDIC been there recently - Just be glad you dont have to deal with IBM mainframe files the :) Btw, what was the answer? Many character sets seem to "allow" ascii compatability in the lower byte range. Did your answer touch on this point? What about file headers? Anything else you can share? – vikingsteve Apr 19 '17 at 12:01