Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
641
votes
14 answers

UTF-8, UTF-16, and UTF-32

What are the differences between UTF-8, UTF-16, and UTF-32? I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other?
user60456
605
votes
21 answers

Best way to convert text files between character sets?

What is the fastest, easiest tool or method to convert text files between character sets? Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa. Everything goes: one-liners in your favorite scripting language, command-line tools…
Antti Kissaniemi
  • 18,944
  • 13
  • 54
  • 47
597
votes
15 answers

Twitter image encoding challenge

If a picture's worth 1000 words, how much of a picture can you fit in 140 characters? Note: That's it folks! Bounty deadline is here, and after some tough deliberation, I have decided that Boojum's entry just barely edged out Sam Hocevar's. I will…
Brian Campbell
  • 322,767
  • 57
  • 360
  • 340
594
votes
7 answers

Why does modern Perl avoid UTF-8 by default?

I wonder why most modern solutions built using Perl don't enable UTF-8 by default. I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21st century, big new projects (or…
w.k
  • 8,218
  • 4
  • 32
  • 55
584
votes
6 answers

Why are emoji characters like ‍‍‍ treated so strangely in Swift strings?

The character ‍‍‍ (family with two women, one girl, and one boy) is encoded as such: U+1F469 WOMAN, ‍U+200D ZWJ, U+1F469 WOMAN, U+200D ZWJ, U+1F467 GIRL, U+200D ZWJ, U+1F466 BOY So it's very interestingly-encoded; the perfect target for a unit test.…
Ky -
  • 30,724
  • 51
  • 192
  • 308
567
votes
53 answers

Best way to reverse a string

I've just had to write a string reverse function in C# 2.0 (i.e. LINQ not available) and came up with this: public string Reverse(string text) { char[] cArray = text.ToCharArray(); string reverse = String.Empty; for (int i =…
Guy
  • 65,082
  • 97
  • 254
  • 325
551
votes
9 answers

What's the difference between ASCII and Unicode?

What's the exact difference between Unicode and ASCII? ASCII has a total of 128 characters (256 in the extended set). Is there any size specification for Unicode characters?
Ashvitha
  • 5,836
  • 6
  • 18
  • 18
542
votes
12 answers

Convert a Unicode string to a string in Python (containing extra symbols)

How do you convert a Unicode string (containing extra characters like £ $, etc.) into a Python string?
William Troup
  • 12,739
  • 21
  • 70
  • 98
494
votes
10 answers

Error "(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape"

I'm trying to read a CSV file into Python (Spyder), but I keep getting an error. My code: import csv data = open("C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener") data = csv.reader(data) print(data) I get the following…
Miesje
  • 4,937
  • 3
  • 10
  • 7
483
votes
9 answers

What are Unicode, UTF-8, and UTF-16?

What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well, but it's not clear to me. In VSS, when doing a file comparison, sometimes there is a message saying the two files have…
SoftwareGeek
  • 15,234
  • 19
  • 61
  • 78
465
votes
13 answers

UnicodeDecodeError, invalid continuation byte

Why is the below item failing? Why does it succeed with "latin-1" codec? o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving v = o.decode("utf-8") Which results in: Traceback (most recent call last): File…
RuiDC
  • 8,403
  • 7
  • 26
  • 21
457
votes
10 answers

How to correct TypeError: Unicode-objects must be encoded before hashing?

I have this error: Traceback (most recent call last): File "python_md5_cracker.py", line 27, in m.update(line) TypeError: Unicode-objects must be encoded before hashing when I try to execute this code in Python 3.2.2: import hashlib,…
JohnnyFromBF
  • 9,873
  • 10
  • 45
  • 59
425
votes
16 answers

How do I grep for all non-ASCII characters?

I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following: grep -e "[\x{00FF}-\x{FFFF}]" file.xml But this returns every line in the file, regardless of whether the line…
pconrey
  • 5,805
  • 7
  • 29
  • 38
421
votes
10 answers

"Unicode Error "unicodeescape" codec can't decode bytes... Cannot open text files in Python 3

I am using Python 3.1 on a Windows 7 machine. Russian is the default system language, and utf-8 is the default encoding. Looking at the answer to a previous question, I have attempting using the "codecs" module to give me a little luck. Here's a few…
Eric
  • 4,283
  • 3
  • 18
  • 7
410
votes
14 answers

Unicode (UTF-8) reading and writing to files in Python

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4). # The string, which has an a-acute in it. ss = u'Capit\xe1n' ss8 = ss.encode('utf8') repr(ss), repr(ss8) ("u'Capit\xe1n'", "'Capit\xc3\xa1n'") print…
Gregg Lind
  • 20,690
  • 15
  • 67
  • 81