Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

U+0041 A
U+0042 B
U+0043 C
...
U+039B Λ
U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

UTF FAQ, UTF-16 FAQ, UTF-8 FAQ

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Latest Version of the Standard

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions

167

votes

9 answers

Python string prints as [u'String']

This will surely be an easy one but it is really bugging me. I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents. All of the text that…

python unicode ascii

asked Mar 01 '09 at 10:48

gnuchu

2,079
3
16
8

164

votes

3 answers

Python: Using .format() on a Unicode-escaped string

I am using Python 2.6.5. My code requires the use of the "more than or equal to" sign. Here it goes: >>> s = u'\u2265' >>> print s >>> ≥ >>> print "{0}".format(s) Traceback (most recent call last): File "", line 1, in …

python string unicode python-2.x

asked Jul 13 '10 at 08:29

Kit

30,365
39
105
149

156

votes

7 answers

Unicode characters in URLs

In 2010, would you serve URLs containing UTF-8 characters in a large web portal? Unicode characters are forbidden as per the RFC on URLs (see here). They would have to be percent encoded to be standards compliant. My main point, though, is serving…

html url unicode utf-8

asked Apr 30 '10 at 07:07

Pekka

442,112
142
972
1,088

154

votes

4 answers

How to use Greek symbols in ggplot2?

My categories need to be named with Greek letters. I am using ggplot2, and it works beautifully with the data. Unfortunately I cannot figure out how to put those greek symbols on the x axis (at the tick marks) and also make them appear in the…

r graphics unicode utf-8 ggplot2

asked Mar 14 '11 at 01:02

Sam

7,922
16
47
62

151

votes

5 answers

Difference between UTF-8 and UTF-16?

Difference between UTF-8 and UTF-16? Why do we need these? MessageDigest md = MessageDigest.getInstance("SHA-256"); String text = "This is some text"; md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed byte[] digest =…

java unicode utf-8 utf-16 utf

asked Jan 11 '11 at 07:38

theJava

14,620
45
131
172

151

votes

8 answers

What's HTML character code 8203?

What does the character code (HTML) ? I found it in one of my jQuery scripts and wondered what it was.. Thanks. Edit: Here is the script it was in (it was added to the end, found it in Firebug)

html unicode character-codes

asked Jun 04 '10 at 11:35

Kyle

65,599
28
144
152

149

votes

8 answers

Why is the length of this string longer than the number of characters in it?

This code: string a = "abc"; string b = "AC"; Console.WriteLine("Length a = {0}", a.Length); Console.WriteLine("Length b = {0}", b.Length); outputs: Length a = 3 Length b = 4 Why? The only thing I could imagine is that the Chinese character is 2…

c# .net string unicode unicode-string

asked Nov 17 '14 at 15:13

weini37

1,455
3
10
9

146

votes

4 answers

How can I add white space before an element's content using CSS?

None of the following code works: p:before { content: " "; } p:before { content: " "; } How do I add white space before an element's content? Note: I need to color the border-left and the margin-left for semantic use and use the space as a…

css unicode space css-content

asked May 14 '13 at 20:36

Hugolpz

17,296
26
100
187

145

votes

7 answers

What is normalized UTF-8 all about?

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching. However, I'm trying to figure out what this means for applications. For example, in…

php c unicode unicode-normalization

asked Oct 28 '11 at 15:14

Xeoncross

55,620
80
262
364

145

votes

9 answers

complete, monospaced Unicode font?

I'm looking for a good programming font that lets me add comments and string literals in Unicode, usually Japanese and Chinese along with some Latin and Cyrillic languages. So far the situation seems to be "complete, monospace, free, pick 2" and…

unicode fonts text-editor

asked Feb 25 '09 at 15:35

nachik

143

votes

6 answers

Why does Python print unicode characters when the default encoding is ASCII?

From the Python 2.6 shell: >>> import sys >>> print sys.getdefaultencoding() ascii >>> print u'\xe9' é >>> I expected to have either some gibberish or an Error after the print statement, since the "é" character isn't part of ASCII and I haven't…

python unicode encoding ascii python-2.x

asked Apr 08 '10 at 00:03

Michael Ekoka

19,050
12
78
79

141

votes

12 answers

Converting Symbols, Accent Letters to English Alphabet

The problem is that, as you know, there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet. For instance here are a few…

java unicode special-characters diacritics

asked Jun 17 '09 at 18:31

ahmet alp balkan

42,679
38
138
214

137

votes

6 answers

Java FileReader encoding issue

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all. Here's my environment: Windows 2003, OS encoding: CP1252 Java 5.0 My files are UTF-8…

java file unicode encoding

asked Mar 30 '09 at 09:55

nybon

8,894
9
59
67

134

votes

3 answers

Unicode equivalents for \w and \b in Java regular expressions?

Many modern regex implementations interpret the \w character class shorthand as "any letter, digit, or connecting punctuation" (usually: underscore). That way, a regex like \w+ matches words like hello, élève, GOÄ_432 or gefräßig. Unfortunately,…

java regex unicode character-properties

asked Nov 29 '10 at 15:00

Tim Pietzcker

328,213
58
503
561

132

votes

6 answers

How can I output UTF-8 from Perl?

I am trying to write a Perl script using the utf8 pragma, and I'm getting unexpected results. I'm using Mac OS X 10.5 (Leopard), and I'm editing with TextMate. All of my settings for both my editor and operating system are defaulted to writing…

perl unicode utf-8

asked Mar 09 '09 at 19:30

Peter Conrey

Prev 1 2 3

…

99 100 Next