0

Seems a database I am working on, had two non printing characters that was messing something up down the line. After doing some digging, the computer shows them as â, then U+0080 then U+0093.

Any idea what these characters could mean? I suspect its something from Unicode that wasn't converted correctly. But I don't know how to translate it.

needoriginalname
  • 703
  • 3
  • 9
  • 27
  • The U notation indicates a Unicode code point (independent of encoding), not bytes (which depend on encoding). You should try not to mix things (the topic is already complex, it makes much less manageable). – Giacomo Catenazzi Jan 23 '19 at 13:42

2 Answers2

1

The Unicode codepoint for â is U+00E2. E2 80 93 is the UTF-8 sequence for a hyphen, specifically U+2013 EN DASH.

If UTF-8-encoded data is incorrectly decoded as ISO-8859-1 (also called "latin1") it is displayed as you describe. Here's an example in Python:

>>> print('\u2013')  # Displays U+2013 EN DASH
–
>>> '\u2013'.encode('utf8') # byte sequence of UTF-8-encoded EN DASH
b'\xe2\x80\x93'
>>> '\u2013'.encode('utf8').decode('latin1')  # decoded incorrectly
'â\x80\x93'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
0

Found a website that described it for me. https://www.compart.com/en/unicode/U+2012#UNC_DB

The numbers matched what appeared in the UTF-8 Encoding.

needoriginalname
  • 703
  • 3
  • 9
  • 27