Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

U+0041 A
U+0042 B
U+0043 C
...
U+039B Λ
U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

UTF FAQ, UTF-16 FAQ, UTF-8 FAQ

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Latest Version of the Standard

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions

641

votes

14 answers

UTF-8, UTF-16, and UTF-32

What are the differences between UTF-8, UTF-16, and UTF-32? I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other?

unicode utf-8 utf-16 utf utf-32

asked Jan 30 '09 at 17:05

user60456

605

votes

21 answers

Best way to convert text files between character sets?

What is the fastest, easiest tool or method to convert text files between character sets? Specifically, I need to convert from UTF-8 to ISO-8859-15 and vice versa. Everything goes: one-liners in your favorite scripting language, command-line tools…

text unicode utf-8 character-set

asked Sep 15 '08 at 17:21

Antti Kissaniemi

18,944
13
54
47

597

votes

15 answers

Twitter image encoding challenge

If a picture's worth 1000 words, how much of a picture can you fit in 140 characters? Note: That's it folks! Bounty deadline is here, and after some tough deliberation, I have decided that Boojum's entry just barely edged out Sam Hocevar's. I will…

twitter unicode compression

asked May 21 '09 at 06:37

Brian Campbell

322,767
57
360
340

594

votes

7 answers

Why does modern Perl avoid UTF-8 by default?

I wonder why most modern solutions built using Perl don't enable UTF-8 by default. I understand there are many legacy problems for core Perl scripts, where it may break things. But, from my point of view, in the 21st century, big new projects (or…

perl unicode utf-8

asked May 28 '11 at 15:12

w.k

8,218
4
32
55

584

votes

6 answers

Why are emoji characters like ‍‍‍ treated so strangely in Swift strings?

The character ‍‍‍ (family with two women, one girl, and one boy) is encoded as such: U+1F469 WOMAN, ‍U+200D ZWJ, U+1F469 WOMAN, U+200D ZWJ, U+1F467 GIRL, U+200D ZWJ, U+1F466 BOY So it's very interestingly-encoded; the perfect target for a unit test.…

swift string unicode emoji

asked Apr 25 '17 at 18:36

Ky -

30,724
51
192
308

567

votes

53 answers

Best way to reverse a string

I've just had to write a string reverse function in C# 2.0 (i.e. LINQ not available) and came up with this: public string Reverse(string text) { char[] cArray = text.ToCharArray(); string reverse = String.Empty; for (int i =…

c# .net performance algorithm unicode

asked Oct 23 '08 at 00:31

Guy

65,082
97
254
325

551

votes

9 answers

What's the difference between ASCII and Unicode?

What's the exact difference between Unicode and ASCII? ASCII has a total of 128 characters (256 in the extended set). Is there any size specification for Unicode characters?

unicode ascii

asked Oct 06 '13 at 18:25

Ashvitha

5,836
6
18
18

542

votes

12 answers

Convert a Unicode string to a string in Python (containing extra symbols)

How do you convert a Unicode string (containing extra characters like £ $, etc.) into a Python string?

string unicode type-conversion python-2.x

asked Jul 30 '09 at 15:41

William Troup

12,739
21
70
98

494

votes

10 answers

Error "(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape"

I'm trying to read a CSV file into Python (Spyder), but I keep getting an error. My code: import csv data = open("C:\Users\miche\Documents\school\jaar2\MIK\2.6\vektis_agb_zorgverlener") data = csv.reader(data) print(data) I get the following…

python csv unicode syntax-error

asked May 23 '16 at 21:36

Miesje

4,937
3
10
7

483

votes

9 answers

What are Unicode, UTF-8, and UTF-16?

What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well, but it's not clear to me. In VSS, when doing a file comparison, sometimes there is a message saying the two files have…

unicode encoding utf-8 utf-16

asked Feb 11 '10 at 00:12

SoftwareGeek

15,234
19
61
78

465

votes

13 answers

UnicodeDecodeError, invalid continuation byte

Why is the below item failing? Why does it succeed with "latin-1" codec? o = "a test of \xe9 char" #I want this to remain a string as this is what I am receiving v = o.decode("utf-8") Which results in: Traceback (most recent call last): File…

python unicode decode

asked Apr 05 '11 at 13:23

RuiDC

8,403
7
26
21

457

votes

10 answers

How to correct TypeError: Unicode-objects must be encoded before hashing?

I have this error: Traceback (most recent call last): File "python_md5_cracker.py", line 27, in m.update(line) TypeError: Unicode-objects must be encoded before hashing when I try to execute this code in Python 3.2.2: import hashlib,…

python python-3.x unicode syntax-error hashlib

asked Sep 28 '11 at 15:04

JohnnyFromBF

9,873
10
45
59

425

votes

16 answers

How do I grep for all non-ASCII characters?

I have several very large XML files and I'm trying to find the lines that contain non-ASCII characters. I've tried the following: grep -e "[\x{00FF}-\x{FFFF}]" file.xml But this returns every line in the file, regardless of whether the line…

regex unix unicode grep

asked Jun 08 '10 at 20:48

pconrey

5,805
7
29
38

421

votes

10 answers

"Unicode Error "unicodeescape" codec can't decode bytes... Cannot open text files in Python 3

I am using Python 3.1 on a Windows 7 machine. Russian is the default system language, and utf-8 is the default encoding. Looking at the answer to a previous question, I have attempting using the "codecs" module to give me a little luck. Here's a few…

python unicode python-3.x

asked Aug 28 '09 at 15:36

Eric

4,283
3
18
7

410

votes

14 answers

Unicode (UTF-8) reading and writing to files in Python

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4). # The string, which has an a-acute in it. ss = u'Capit\xe1n' ss8 = ss.encode('utf8') repr(ss), repr(ss8) ("u'Capit\xe1n'", "'Capit\xc3\xa1n'") print…

python unicode utf-8 io

asked Jan 29 '09 at 15:01

Gregg Lind

20,690
15
67
81

Prev 1

…

99 100 Next