Why does the 'degree' symbol differ from UTF-8 from Unicode?

Question

Why does degree symbol differ from UTF-8 from Unicode?

According to http://www.utf8-chartable.de/ and http://www.fileformat.info/info/unicode/char/b0/index.htm, Unicode is B0, but UTF-8 is C2 B0 How come?

There are thousands of characters whose representation differs between UTF-8 and UTF-16. What makes you believe that the degree symbol deserves special treatment? — Mike Nakis, Jan 04 '12 at 18:38
You need to understand the difference between Unicode and its various encodings. Read the links people have posted. — tripleee, Jan 04 '12 at 18:43
@MikeNakis: I believe that *all* Unicode code points have different representations in UTF-8 and UTF-16. — Keith Thompson, Mar 07 '13 at 18:25
Do you mean *"Why does the 'degree' symbol differ between UTF-8 and the [Unicode code point](https://en.wikipedia.org/wiki/Code_point)?"*? — Peter Mortensen, Apr 27 '23 at 18:04

NPE · Accepted Answer · 2012-01-04T21:23:04.873

28

UTF-8 is a way to encode UTF characters using variable number of bytes (the number of bytes depends on the code point).

Code points between U+0080 and U+07FF use the following 2-byte encoding:

110xxxxx 10xxxxxx

where x represent the bits of the code point being encoded.

Let's consider U+00B0. In binary, 0xB0 is 10110000. If one substitutes the bits into the above template, one gets:

 11000010 10110000

In hex, this is 0xC2 0xB0.

edited Jan 04 '12 at 21:23

answered Jan 04 '12 at 18:39

NPE

486,780
108
951
1,012

1

And, crucially, that is simply a different representation of the same Unicode code point, U+00B0. UTF-16 uses 0x00 0xB0 (big-endian) or 0xB0 0x00 (little-endian), but UTF-8 uses 0xC2 0xB0 (no endian-ness). – Jonathan Leffler Jan 04 '12 at 18:42
The link you provide is very helpful ... Thanks – Muhammad Hewedy Jan 04 '12 at 21:08
@JonathanLeffler "No endian-ness" Not proper terminology but funny – ReinstateMonica3167040 Nov 09 '17 at 01:41
@Userthatisnotauser It is the proper terminology. https://en.wikipedia.org/wiki/Endianness – clacke Jan 29 '20 at 15:39
@clake Sorry, I was referring to the odd placement of the dash. Definitely agree! ‍♂️ – ReinstateMonica3167040 Jan 29 '20 at 19:27

score 7 · Answer 2 · answered Jan 04 '12 at 19:21

UTF-8 is one encoding of Unicode. UTF-16 and UTF-32 are other encodings of Unicode.

Unicode defines a numeric value for each character; the degree symbol happens to be 0xB0, or 176 in decimal. Unicode does not define how those numeric values are represented.

UTF-8 encodes the value 0xB0 as two consecutive octets (bytes) with values 0xC2 0xB0.

UTF-16 encodes the same value either as 0x00 0xB0 or as 0xBo 0x00, depending on endianness.

UTF-32 encodes it as 0x00 0x00 0x00 0xB0 or as 0xB0 0x00 0x00 0x00, again depending on endianness (I suppose other orderings are possible).

score 5 · Answer 3 · edited Nov 09 '17 at 07:32

Unicode (UTF-16 and UTF-32) uses the code point 0x00B0 for that character. UTF-8 doesn't allow characters at values above 127 (0x007F), as the high bit of each byte is reserved to indicate that this particular character is actually a multi-byte one.

Basic 7-bit ASCII maps directly to the first 128 characters of UTF-8. Any characters whose values are above 127 decimal (7F hex) must be "escaped" by setting the high bit and adding 1 or more extra bytes to describe.

score 1 · Answer 4 · edited Nov 09 '17 at 02:52

1

The answers from NPE, Marc and Keith are good and above my knowledge on this topic. Still I had to read them a couple of times before I realized what this was about. Then I saw this web page that made it "click" for me.

At http://www.utf8-chartable.de/, you can see the following:

UTF-8 needs C2 80 to represent U+0080

Notice how it is necessary to use TWO bytes to code ONE character. Now read the accepted answer from NPE.

edited Nov 09 '17 at 02:52

answered Mar 16 '14 at 07:17

Tormod

4,551
2
28
50

If a user can't see the website, it shows `0x7F` (DELETE) as UTF-8: `7F`, and `0x80` as UTF-8: `C2 80` – ReinstateMonica3167040 Nov 09 '17 at 01:44

Why does the 'degree' symbol differ from UTF-8 from Unicode?

4 Answers4

Linked