Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
167
votes
9 answers

Python string prints as [u'String']

This will surely be an easy one but it is really bugging me. I have a script that reads in a webpage and uses Beautiful Soup to parse it. From the soup I extract all the links as my final goal is to print out the link.contents. All of the text that…
gnuchu
  • 2,079
  • 3
  • 16
  • 8
164
votes
3 answers

Python: Using .format() on a Unicode-escaped string

I am using Python 2.6.5. My code requires the use of the "more than or equal to" sign. Here it goes: >>> s = u'\u2265' >>> print s >>> ≥ >>> print "{0}".format(s) Traceback (most recent call last): File "", line 1, in
Kit
  • 30,365
  • 39
  • 105
  • 149
156
votes
7 answers

Unicode characters in URLs

In 2010, would you serve URLs containing UTF-8 characters in a large web portal? Unicode characters are forbidden as per the RFC on URLs (see here). They would have to be percent encoded to be standards compliant. My main point, though, is serving…
Pekka
  • 442,112
  • 142
  • 972
  • 1,088
154
votes
4 answers

How to use Greek symbols in ggplot2?

My categories need to be named with Greek letters. I am using ggplot2, and it works beautifully with the data. Unfortunately I cannot figure out how to put those greek symbols on the x axis (at the tick marks) and also make them appear in the…
Sam
  • 7,922
  • 16
  • 47
  • 62
151
votes
5 answers

Difference between UTF-8 and UTF-16?

Difference between UTF-8 and UTF-16? Why do we need these? MessageDigest md = MessageDigest.getInstance("SHA-256"); String text = "This is some text"; md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed byte[] digest =…
theJava
  • 14,620
  • 45
  • 131
  • 172
151
votes
8 answers

What's HTML character code 8203?

What does the character code (HTML) ​? I found it in one of my jQuery scripts and wondered what it was.. Thanks. Edit: Here is the script it was in (it was added to the end, found it in Firebug)
Kyle
  • 65,599
  • 28
  • 144
  • 152
149
votes
8 answers

Why is the length of this string longer than the number of characters in it?

This code: string a = "abc"; string b = "AC"; Console.WriteLine("Length a = {0}", a.Length); Console.WriteLine("Length b = {0}", b.Length); outputs: Length a = 3 Length b = 4 Why? The only thing I could imagine is that the Chinese character is 2…
weini37
  • 1,455
  • 3
  • 10
  • 9
146
votes
4 answers

How can I add white space before an element's content using CSS?

None of the following code works: p:before { content: " "; } p:before { content: " "; } How do I add white space before an element's content? Note: I need to color the border-left and the margin-left for semantic use and use the space as a…
Hugolpz
  • 17,296
  • 26
  • 100
  • 187
145
votes
7 answers

What is normalized UTF-8 all about?

The ICU project (which also now has a PHP library) contains the classes needed to help normalize UTF-8 strings to make it easier to compare values when searching. However, I'm trying to figure out what this means for applications. For example, in…
Xeoncross
  • 55,620
  • 80
  • 262
  • 364
145
votes
9 answers

complete, monospaced Unicode font?

I'm looking for a good programming font that lets me add comments and string literals in Unicode, usually Japanese and Chinese along with some Latin and Cyrillic languages. So far the situation seems to be "complete, monospace, free, pick 2" and…
nachik
  • 754
  • 3
  • 9
  • 11
143
votes
6 answers

Why does Python print unicode characters when the default encoding is ASCII?

From the Python 2.6 shell: >>> import sys >>> print sys.getdefaultencoding() ascii >>> print u'\xe9' é >>> I expected to have either some gibberish or an Error after the print statement, since the "é" character isn't part of ASCII and I haven't…
Michael Ekoka
  • 19,050
  • 12
  • 78
  • 79
141
votes
12 answers

Converting Symbols, Accent Letters to English Alphabet

The problem is that, as you know, there are thousands of characters in the Unicode chart and I want to convert all the similar characters to the letters which are in English alphabet. For instance here are a few…
ahmet alp balkan
  • 42,679
  • 38
  • 138
  • 214
137
votes
6 answers

Java FileReader encoding issue

I tried to use java.io.FileReader to read some text files and convert them into a string, but I found the result is wrongly encoded and not readable at all. Here's my environment: Windows 2003, OS encoding: CP1252 Java 5.0 My files are UTF-8…
nybon
  • 8,894
  • 9
  • 59
  • 67
134
votes
3 answers

Unicode equivalents for \w and \b in Java regular expressions?

Many modern regex implementations interpret the \w character class shorthand as "any letter, digit, or connecting punctuation" (usually: underscore). That way, a regex like \w+ matches words like hello, élève, GOÄ_432 or gefräßig. Unfortunately,…
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
132
votes
6 answers

How can I output UTF-8 from Perl?

I am trying to write a Perl script using the utf8 pragma, and I'm getting unexpected results. I'm using Mac OS X 10.5 (Leopard), and I'm editing with TextMate. All of my settings for both my editor and operating system are defaulted to writing…
Peter Conrey