Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

U+0041 A
U+0042 B
U+0043 C
...
U+039B Λ
U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

UTF FAQ, UTF-16 FAQ, UTF-8 FAQ

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Latest Version of the Standard

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions

131

votes

7 answers

Why does ENcoding a string result in a DEcoding error (UnicodeDecodeError)?

I'm really confused. I tried to encode but the error said can't decode.... >>> "你好".encode("utf8") Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 0:…

python python-2.7 unicode python-2.x python-unicode

asked Mar 10 '12 at 05:10

thoslin

6,659
6
27
29

130

votes

8 answers

How can I remove non-ASCII characters but leave periods and spaces?

I'm working with a .txt file. I want a string of the text from the file with no non-ASCII characters. However, I want to leave spaces and periods. At present, I'm stripping those too. Here's the code: def onlyascii(char): if ord(char) < 48 or…

python text unicode filter ascii

asked Dec 31 '11 at 18:23

user1120342

129

votes

13 answers

Creating Unicode character from its number

I want to display a Unicode character in Java. If I do this, it works just fine: String symbol = "\u2202"; symbol is equal to "∂". That's what I want. The problem is that I know the Unicode number and need to create the Unicode symbol from that. …

java string unicode character

asked Apr 07 '11 at 18:40

Paul Reiners

8,576
33
117
202

128

votes

3 answers

How does UTF-8 "variable-width encoding" work?

The unicode standard has enough code-points in it that you need 4 bytes to store them all. That's what the UTF-32 encoding does. Yet the UTF-8 encoding somehow squeezes these into much smaller spaces by using something called "variable-width…

unicode utf-8 character-encoding multibyte

asked Oct 09 '09 at 13:02

dsimard

4,245
5
22
16

126

votes

7 answers

Difference between open and codecs.open in Python

There are two ways to open a text file in Python: f = open(filename) And import codecs f = codecs.open(filename, encoding="utf-8") When is codecs.open preferable to open?

python unicode codec

asked Mar 09 '11 at 18:56

BlogueroConnor

1,893
4
17
18

124

votes

3 answers

What are the most common non-BMP Unicode characters in actual use?

In your experience which Unicode characters, codepoints, ranges outside the BMP (Basic Multilingual Plane) are the most common so far? These are the ones which require 4 bytes in UTF-8 or surrogates in UTF-16. I would've expected the answer to be…

unicode cjk codepoint surrogate-pairs astral-plane

asked Apr 06 '11 at 13:36

hippietrail

15,848
18
99
158

123

votes

5 answers

How to make unicode string with python3

I used this : u = unicode(text, 'utf-8') But getting error with Python 3 (or... maybe I just forgot to include something) : NameError: global name 'unicode' is not defined Thank you.

python unicode python-3.x

asked Jul 25 '11 at 05:16

cnd

32,616
62
183
313

121

votes

8 answers

How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?

Is there a function in PHP that can decode Unicode escape sequences like "\u00ed" to "í" and all other similar occurrences? I found similar question here but is doesn't seem to work.

php unicode utf-8 escaping decoding

asked May 29 '10 at 09:53

Docstero

1,287
3
11
6

120

votes

7 answers

Is there an HTML entity for an info icon?

I am looking for a basic information icon like this:

html unicode html-entities

asked Nov 23 '15 at 18:58

Alexcamostyle

3,623
4
14
13

119

votes

12 answers

UnicodeEncodeError: 'latin-1' codec can't encode character

What could be causing this error when I try to insert a foreign character into the database? >>UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201c' in position 0: ordinal not in range(256) And how do I resolve it? Thanks!

python mysql unicode pylons

asked Oct 15 '10 at 13:57

ensnare

40,069
64
158
224

119

votes

9 answers

Character reading from file in Python

In a text file, there is a string "I don't like this". However, when I read it into a string, it becomes "I don\xe2\x80\x98t like this". I understand that \u2018 is the unicode representation of "'". I use f1 = open (file1, "r") text =…

python unicode encoding ascii

asked Sep 29 '08 at 06:47

Graviton

81,782
146
424
602

119

votes

6 answers

How to put a unicode character in XAML?

I'm trying to do this: To get a — to appear in front of the text. It…

wpf xaml unicode binding

asked Sep 02 '09 at 11:43

Alex Baranosky

48,865
44
102
150

119

votes

8 answers

What's the complete range for Chinese characters in Unicode?

Unicode allocated U+4E00..U+9FFF for Chinese characters. This is part of the complete set, but not all.

unicode cjk

asked Sep 02 '09 at 06:13

omg

136,412
142
288
348

117

votes

12 answers

HTML for the Pause symbol in audio and video control

I'm trying to find the Unicode symbol to make a button display the Unicode pause symbol. I was able to find that the Unicode play symbol is ► but I'm looking for the equivalent of the Unicode pause symbol.

html unicode special-characters symbols

asked Apr 05 '14 at 19:28

user3081307

1,211
2
10
8

116

votes

5 answers

What is the proper way to URL encode Unicode characters?

I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C. Some interesting examples: The heart character. If I type this into my browser: http://www.google.com/search?q=♥ Then…

unicode utf-8 character-encoding urlencode web-standards

asked May 26 '09 at 21:18

Josh Gibson

21,808
28
67
63

Prev 1 2 3

…

99 100 Next