I have found two non-printing characters in a database, what do they mean?

Question

Seems a database I am working on, had two non printing characters that was messing something up down the line. After doing some digging, the computer shows them as â, then U+0080 then U+0093.

Any idea what these characters could mean? I suspect its something from Unicode that wasn't converted correctly. But I don't know how to translate it.

The U notation indicates a Unicode code point (independent of encoding), not bytes (which depend on encoding). You should try not to mix things (the topic is already complex, it makes much less manageable). — Giacomo Catenazzi, Jan 23 '19 at 13:42

score 1 · Accepted Answer · answered Jan 23 '19 at 08:06

The Unicode codepoint for â is U+00E2. E2 80 93 is the UTF-8 sequence for a hyphen, specifically U+2013 EN DASH.

If UTF-8-encoded data is incorrectly decoded as ISO-8859-1 (also called "latin1") it is displayed as you describe. Here's an example in Python:

>>> print('\u2013')  # Displays U+2013 EN DASH
–
>>> '\u2013'.encode('utf8') # byte sequence of UTF-8-encoded EN DASH
b'\xe2\x80\x93'
>>> '\u2013'.encode('utf8').decode('latin1')  # decoded incorrectly
'â\x80\x93'

score 0 · Answer 2 · answered Jan 22 '19 at 21:27

0

Found a website that described it for me. https://www.compart.com/en/unicode/U+2012#UNC_DB

The numbers matched what appeared in the UTF-8 Encoding.

answered Jan 22 '19 at 21:27

needoriginalname

703
3
9
27

I have found two non-printing characters in a database, what do they mean?

2 Answers2