Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

U+0041 A
U+0042 B
U+0043 C
...
U+039B Λ
U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

UTF FAQ, UTF-16 FAQ, UTF-8 FAQ

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Latest Version of the Standard

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions

406

votes

6 answers

How to find the length of a string in R

How to find the length of a string (i.e., number of characters in a string) without splitting it in R? I know how to find the length of a list but not of a string. And what about Unicode strings? How do I find the length (in bytes) and the number of…

r string unicode string-length

asked Jun 21 '12 at 09:01

Igor Chubin

61,765
13
122
144

404

votes

12 answers

Print Directory & File Structure with icons for representation in Markdown

I want a Linux command to print directory & file structures in the form of a tree, possibly with Unicode icons before each file, and some hint for the best syntax to include the output in a Markdown document, without spaces between…

unicode markdown directory-structure

asked Oct 31 '13 at 05:27

Matt Rowles

7,721
18
55
88

394

votes

2 answers

Unicode, UTF, ASCII, ANSI format differences

What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings? In what way are these helpful for programmers?

unicode character-encoding ascii ansi utf

asked Mar 31 '09 at 06:02

web dunia

9,381
18
52
64

390

votes

1 answer

Placing Unicode character in CSS content value

I have a problem. I have found the HTML code for the downwards arrow, ↓ (↓) Cool. Now I need to use it in CSS like so: nav a:hover {content:"&darr";} That obviously won't work since ↓ is an HTML symbol. There seems to be less info about…

css unicode symbols unicode-escapes

asked May 01 '12 at 04:13

davecave

4,698
6
26
32

364

votes

6 answers

Why does 2+ 40 equal 42?

I was baffled when a colleague showed me this line of JavaScript alerting 42. alert(2+ 40); It quickly turns out that what looks like a minus sign is actually an arcane Unicode character with clearly different semantics. This left me wondering…

javascript unicode

asked Jul 19 '15 at 23:48

GOTO 0

42,323
22
125
158

356

votes

16 answers

How to remove \xa0 from string in Python?

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into…

python python-2.7 unicode beautifulsoup utf-8

asked Jun 12 '12 at 09:12

zhuyxn

6,671
9
38
44

355

votes

19 answers

How can I use Unicode characters on the Windows command line?

We have a project in Team Foundation Server (TFS) that has a non-English character (š) in it. When trying to script a few build-related things, we've stumbled upon a problem; we can't pass the š letter to the command-line tools. The command prompt…

unicode command-line input windows-console

asked Dec 23 '08 at 09:30

Vilx-

104,512
87
279
422

322

votes

12 answers

Replace non-ASCII characters with a single space

I need to replace all non-ASCII (\x00-\x7F) characters with a space. I'm surprised that this is not dead-easy in Python, unless I'm missing something. The following function simply removes all non-ASCII characters: def remove_non_ascii_1(text): …

python unicode encoding ascii

asked Nov 19 '13 at 18:09

dotancohen

30,064
36
138
197

319

votes

7 answers

"SyntaxError: Non-ASCII character ..." or "SyntaxError: Non-UTF-8 code starting with ..." trying to use non-ASCII text in a Python script

I tried this code in Python 2: def NewFunction(): return '£' But I get an error message that says: SyntaxError: Non-ASCII character '\xa3' in file '...' but no encoding declared; see http://www.python.org/peps/pep-0263.html for…

python unicode python-unicode

asked May 14 '12 at 19:12

SNIFFER_dog

3,243
2
14
4

310

votes

12 answers

How do I check if a string is unicode or ascii?

What do I have to do in Python to figure out which encoding a string has?

python unicode encoding utf-8

asked Feb 13 '11 at 22:27

TIMEX

259,804
351
777
1,080

305

votes

6 answers

u'\ufeff' in Python string

I got an error with the following exception message: UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128) Not sure what u'\ufeff' is, it shows up when I'm web scraping. How can I remedy the…

python unicode utf-8

asked Jul 28 '13 at 20:02

James Hallen

4,534
4
23
28

301

votes

21 answers

How to get string objects instead of Unicode from JSON

I'm using Python 2 to parse JSON from ASCII encoded text files. When loading these files with either json or simplejson, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some…

python json serialization unicode python-2.x

asked Jun 05 '09 at 16:32

Brutus

7,139
7
36
41

292

votes

12 answers

How many bytes does one Unicode character take?

I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require? I assume that one Unicode character can contain every possible character from any language - am…

string language-agnostic unicode encoding

asked Mar 13 '11 at 15:02

nan

19,595
7
48
80

283

votes

10 answers

Concrete JavaScript regular expression for accented characters (diacritics)

I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question "How can JavaScript match accented characters (those…

javascript regex unicode

asked Dec 19 '13 at 19:54

Chris Cirefice

5,475
7
45
75

272

votes

2 answers

What's the difference between a character, a code point, a glyph and a grapheme?

Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes - concepts which in the simplest case, when dealing with English text using ASCII…

string unicode terminology

asked Dec 06 '14 at 12:44

Mark Amery

143,130
81
406
459

Prev 1 2

…

99 100 Next