Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
406
votes
6 answers

How to find the length of a string in R

How to find the length of a string (i.e., number of characters in a string) without splitting it in R? I know how to find the length of a list but not of a string. And what about Unicode strings? How do I find the length (in bytes) and the number of…
Igor Chubin
  • 61,765
  • 13
  • 122
  • 144
404
votes
12 answers

Print Directory & File Structure with icons for representation in Markdown

I want a Linux command to print directory & file structures in the form of a tree, possibly with Unicode icons before each file, and some hint for the best syntax to include the output in a Markdown document, without spaces between…
Matt Rowles
  • 7,721
  • 18
  • 55
  • 88
394
votes
2 answers

Unicode, UTF, ASCII, ANSI format differences

What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings? In what way are these helpful for programmers?
web dunia
  • 9,381
  • 18
  • 52
  • 64
390
votes
1 answer

Placing Unicode character in CSS content value

I have a problem. I have found the HTML code for the downwards arrow, ↓ (↓) Cool. Now I need to use it in CSS like so: nav a:hover {content:"&darr";} That obviously won't work since ↓ is an HTML symbol. There seems to be less info about…
davecave
  • 4,698
  • 6
  • 26
  • 32
364
votes
6 answers

Why does 2+ 40 equal 42?

I was baffled when a colleague showed me this line of JavaScript alerting 42. alert(2+ 40); It quickly turns out that what looks like a minus sign is actually an arcane Unicode character with clearly different semantics. This left me wondering…
GOTO 0
  • 42,323
  • 22
  • 125
  • 158
356
votes
16 answers

How to remove \xa0 from string in Python?

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into…
zhuyxn
  • 6,671
  • 9
  • 38
  • 44
355
votes
19 answers

How can I use Unicode characters on the Windows command line?

We have a project in Team Foundation Server (TFS) that has a non-English character (š) in it. When trying to script a few build-related things, we've stumbled upon a problem; we can't pass the š letter to the command-line tools. The command prompt…
Vilx-
  • 104,512
  • 87
  • 279
  • 422
322
votes
12 answers

Replace non-ASCII characters with a single space

I need to replace all non-ASCII (\x00-\x7F) characters with a space. I'm surprised that this is not dead-easy in Python, unless I'm missing something. The following function simply removes all non-ASCII characters: def remove_non_ascii_1(text): …
dotancohen
  • 30,064
  • 36
  • 138
  • 197
319
votes
7 answers

"SyntaxError: Non-ASCII character ..." or "SyntaxError: Non-UTF-8 code starting with ..." trying to use non-ASCII text in a Python script

I tried this code in Python 2: def NewFunction(): return '£' But I get an error message that says: SyntaxError: Non-ASCII character '\xa3' in file '...' but no encoding declared; see http://www.python.org/peps/pep-0263.html for…
SNIFFER_dog
  • 3,243
  • 2
  • 14
  • 4
310
votes
12 answers

How do I check if a string is unicode or ascii?

What do I have to do in Python to figure out which encoding a string has?
TIMEX
  • 259,804
  • 351
  • 777
  • 1,080
305
votes
6 answers

u'\ufeff' in Python string

I got an error with the following exception message: UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128) Not sure what u'\ufeff' is, it shows up when I'm web scraping. How can I remedy the…
James Hallen
  • 4,534
  • 4
  • 23
  • 28
301
votes
21 answers

How to get string objects instead of Unicode from JSON

I'm using Python 2 to parse JSON from ASCII encoded text files. When loading these files with either json or simplejson, all my string values are cast to Unicode objects instead of string objects. The problem is, I have to use the data with some…
Brutus
  • 7,139
  • 7
  • 36
  • 41
292
votes
12 answers

How many bytes does one Unicode character take?

I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require? I assume that one Unicode character can contain every possible character from any language - am…
nan
  • 19,595
  • 7
  • 48
  • 80
283
votes
10 answers

Concrete JavaScript regular expression for accented characters (diacritics)

I've looked on Stack Overflow (replacing characters.. eh, how JavaScript doesn't follow the Unicode standard concerning RegExp, etc.) and haven't really found a concrete answer to the question "How can JavaScript match accented characters (those…
Chris Cirefice
  • 5,475
  • 7
  • 45
  • 75
272
votes
2 answers

What's the difference between a character, a code point, a glyph and a grapheme?

Trying to understand the subtleties of modern Unicode is making my head hurt. In particular, the distinction between code points, characters, glyphs and graphemes - concepts which in the simplest case, when dealing with English text using ASCII…
Mark Amery
  • 143,130
  • 81
  • 406
  • 459