Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
131
votes
7 answers

Why does ENcoding a string result in a DEcoding error (UnicodeDecodeError)?

I'm really confused. I tried to encode but the error said can't decode.... >>> "你好".encode("utf8") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0:…
thoslin
  • 6,659
  • 6
  • 27
  • 29
130
votes
8 answers

How can I remove non-ASCII characters but leave periods and spaces?

I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code: def onlyascii(char): if ord(char) < 48 or…
user1120342
129
votes
13 answers

Creating Unicode character from its number

I want to display a Unicode character in Java. If I do this, it works just fine: String symbol = "\u2202"; symbol is equal to "∂". That's what I want. The problem is that I know the Unicode number and need to create the Unicode symbol from that. …
Paul Reiners
  • 8,576
  • 33
  • 117
  • 202
128
votes
3 answers

How does UTF-8 "variable-width encoding" work?

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width…
dsimard
  • 4,245
  • 5
  • 22
  • 16
126
votes
7 answers

Difference between open and codecs.open in Python

There are two ways to open a text file in Python: f = open(filename) And import codecs f = codecs.open(filename, encoding="utf-8") When is codecs.open preferable to open?
BlogueroConnor
  • 1,893
  • 4
  • 17
  • 18
124
votes
3 answers

What are the most common non-BMP Unicode characters in actual use?

In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16. I would've expected the answer to be…
hippietrail
  • 15,848
  • 18
  • 99
  • 158
123
votes
5 answers

How to make unicode string with python3

I used this : u = unicode(text, 'utf-8') But getting error with Python 3 (or... maybe I just forgot to include something) : NameError: global name 'unicode' is not defined Thank you.
cnd
  • 32,616
  • 62
  • 183
  • 313
121
votes
8 answers

How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?

Is there a function in PHP that can decode Unicode escape sequences like "\u00ed" to "í" and all other similar occurrences? I found similar question here but is doesn't seem to work.
Docstero
  • 1,287
  • 3
  • 11
  • 6
120
votes
7 answers

Is there an HTML entity for an info icon?

I am looking for a basic information icon like this:
Alexcamostyle
  • 3,623
  • 4
  • 14
  • 13
119
votes
12 answers

UnicodeEncodeError: 'latin-1' codec can't encode character

What could be causing this error when I try to insert a foreign character into the database? >>UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256) And how do I resolve it? Thanks!
ensnare
  • 40,069
  • 64
  • 158
  • 224
119
votes
9 answers

Character reading from file in Python

In a text file, there is a string "I don't like this". However, when I read it into a string, it becomes "I don\xe2\x80\x98t like this". I understand that \u2018 is the unicode representation of "'". I use f1 = open (file1, "r") text =…
Graviton
  • 81,782
  • 146
  • 424
  • 602
119
votes
6 answers

How to put a unicode character in XAML?

I'm trying to do this: To get a — to appear in front of the text. It…
Alex Baranosky
  • 48,865
  • 44
  • 102
  • 150
119
votes
8 answers

What's the complete range for Chinese characters in Unicode?

Unicode allocated U+4E00..U+9FFF for Chinese characters. This is part of the complete set, but not all.
omg
  • 136,412
  • 142
  • 288
  • 348
117
votes
12 answers

HTML for the Pause symbol in audio and video control

I'm trying to find the Unicode symbol to make a button display the Unicode pause symbol. I was able to find that the Unicode play symbol is ► but I'm looking for the equivalent of the Unicode pause symbol.
user3081307
  • 1,211
  • 2
  • 10
  • 8
116
votes
5 answers

What is the proper way to URL encode Unicode characters?

I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C. Some interesting examples: The heart character. If I type this into my browser: http://www.google.com/search?q=♥ Then…
Josh Gibson
  • 21,808
  • 28
  • 67
  • 63