ASCII vs Unicode + UTF-8

Question

Was reading Joel Spolsky's 'The Absolute Minimum' about character encoding. It is my understanding that ASCII is a Code-point + Encoding scheme, and in modern times, we use Unicode as the Code-point scheme and UTF-8 as the Encoding scheme. Is this correct?

Historical + technical overview (fixed the confusion for me): [Characters, Symbols and the Unicode Miracle - Computerphile](https://www.youtube.com/watch?v=MijmeoH9LT4) — mshwf, May 30 '20 at 17:38

score 74 · Answer 1 · answered Jan 23 '14 at 01:49

74

In modern times, ASCII is now a subset of UTF-8, not its own scheme. UTF-8 is backwards compatible with ASCII.

answered Jan 23 '14 at 01:49

Remy Lebeau

555,201
31
458
770

1

Ok. Before UTF-8, was ASCII a combined code-point+encoding system? I only ask because I would like to learn how the ASCII system evolved. – Quest Monger Jan 23 '14 at 03:57
1

ASCII defines codepoint values (they were not called codepoints until Unicode came along) 0-127, but it does not define their encodings. All language encodings use the same values as ASCII for their first 128 characters. UTF-8, ISO encodings, Latin encodings, etc are all 8bit encodings that support ASCII values. UTF-16 and UTF-32 are 16/32bit encodings that also support ASCII values. Codepoint values and their encoded Codeunit values within a given encoding are two separate things. – Remy Lebeau Jan 23 '14 at 05:02
3

Sort of. ASCII technically only defines the first 7 bits. But most ASCII + code page schemes have an extra 128 characters, such as Windows (1252) or Mac OS Roman (10000). These are all referred to as "ASCII", but UTF-8 doesn't match any of them if you go over 127. – PRMan Feb 06 '18 at 18:07
@PRMan those are all commonly referred to as ANSI encodings (even though they are not actually defined by ANSI), not as ASCII. Most devs understand that ASCII is just 7bits and so only covers characters 0-127, 128-255 are handled by ANSI, and beyond that is handled by Unicode. – Remy Lebeau Feb 06 '18 at 19:01
Look up ATASCII on Wikipedia. It is referred to as a "non-standard ASCII" for Atari 8-bit computers. The term "ANSI encoding" is not present in the article. But it is referred to as an ASCII, despite the article being mostly about the differences. Same at ascii-table.com, where ANSI is not mentioned, except as a search term at the bottom. In fact, ascii-table.com says ANSI is "a misnomer that continues to persist in the Windows community" – PRMan Jun 09 '19 at 00:09

Jukka K. Korpela · Accepted Answer · 2014-01-23T07:58:13.000

Yes, except that UTF-8 is an encoding scheme. Other encoding schemes include UTF-16 (with two different byte orders) and UTF-32. (For some confusion, a UTF-16 scheme is called “Unicode” in Microsoft software.)

And, to be exact, the American National Standard that defines ASCII specifies a collection of characters and their coding as 7-bit quantities, without specifying a particular transfer encoding in terms of bytes. In the past, it was used in different ways, e.g. so that five ASCII characters were packed into one 36-bit storage unit or so that 8-bit bytes used the extra bytes for checking purposes (parity bit) or for transfer control. But nowadays ASCII is used so that one ASCII character is encoded as one 8-bit byte with the first bit set to zero. This is the de facto standard encoding scheme and implied in a large number of specifications, but strictly speaking not part of the ASCII standard.

the reason is that when MS introduced Unicode support, UTF-8 didn't exist and UCS-2 was the only encoding. Therefore when Unicode 2.0 is released the only way for them is moving to UTF-16 — phuclv, Sep 08 '18 at 02:04

score 1 · Answer 3 · answered Nov 30 '22 at 08:26

Unicode and ASCII are both Codepoints + Encoding scheme

Unicode(UTF-8) is a superset of ASCII as its backward compatible with ASCII.

Conversion and Representation(in binary/hexadecimal) of String:

String := sequence of Graphemes(character is a "kind of" its subset).
Sequence of graphemes(characters) is converted into Codepoints (also using Encoding scheme)
Codepoints are Encoded(converted) to binary/hex also using Encoding Schemes for Graphemes its UTF-8/UTF-32(aka Unicodes), for Character its ASCII.

Unicode(UTF-8) supports 1,112,064 valid character codepoints(covers most of the graphemes from different languages)

ASCII supports 128 character codepoints(mostly english)

ASCII vs Unicode + UTF-8

3 Answers3

Linked