Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
14
votes
2 answers

Monospaced font/symbols for JTextPane

I want to build a console-like output using JTextPane. Therefore I am using a monospaced font: textpane.setFont(new Font(Font.MONOSPACED, Font.PLAIN, 12)); This works fine for all kind of alphanum (like a-z, 0-9 etc.) characters, but when it comes…
user28061
  • 354
  • 1
  • 4
  • 14
14
votes
6 answers

Unicode string with diacritics split by chars

I have this Unicode string: Ааа́Ббб́Ввв́Г㥴Дд And I want to it split by chars. Right now if I try to loop truth all chars I get something like this: A a a ' Б ... Is there a way to properly split this string to chars: А а а́ ?
Gapipro
  • 1,913
  • 2
  • 22
  • 34
14
votes
1 answer

Strange `UnicodeEncodeError` using `os.path.exists`

In a web-application (using Flask), I get the following error: Unable to retrieve the thumbnail for u'/var/data/uploads/2012/03/22/12 Gerd\xb4s Banjo Trio 1024.jpg' Traceback (most recent call last): File…
exhuma
  • 20,071
  • 12
  • 90
  • 123
14
votes
2 answers

Python: Creating a Unicode string

I have a problem in Python with Unicode. I need plot a graph with Unicode annotations in it. According to the tutorial I should just create my string in Unicode. I do it like this: annotation = u"%s has %s rev"%(art.title, len(art.revisions)) It is…
ashim
  • 24,380
  • 29
  • 72
  • 96
14
votes
1 answer

How can I open UTF-16 files on Python 2.x?

I'm working on a Python tool that must be able to open files of UTF-8 and UTF-16 encoding. In Python 3.2, I use the following code to try opening the file using UTF-8, then try it with UTF-16 if there's a unicode error: def readGridFromPath(self,…
stalepretzel
  • 15,543
  • 22
  • 76
  • 91
13
votes
2 answers

What does the expression \X match when inside a RegEx?

According to http://www.regular-expressions.info, You can consider \X the Unicode version of the dot in regex engines that use plain ASCII. Does this mean that it will match any possible Unicode code point?
federico-t
  • 12,014
  • 19
  • 67
  • 111
13
votes
2 answers

In Python, how do I convert a list of ints and strings to Unicode?

x = ['Some strings.', 1, 2, 3, 'More strings!', 'Fanc\xc3\xbf string!'] y = [i.decode('UTF-8') for i in x] What's the best way to convert the strings in x to Unicode? Doing a list compression causes an attribute error (AttributeError: 'int' object…
Buttons840
  • 9,239
  • 15
  • 58
  • 85
13
votes
4 answers

Why does the 'degree' symbol differ from UTF-8 from Unicode?

Why does degree symbol differ from UTF-8 from Unicode? According to http://www.utf8-chartable.de/ and http://www.fileformat.info/info/unicode/char/b0/index.htm, Unicode is B0, but UTF-8 is C2 B0 How come?
Muhammad Hewedy
  • 29,102
  • 44
  • 127
  • 219
13
votes
3 answers

UnicodeDecodeError on join

I have a list with some strings (most of which I fetched from a sqlite3 database): stats_list = ['Statistik \xc3\xb6ver s\xc3\xa5nger\n', 'Antal\tS\xc3\xa5ng', '1\tCarola - Betlehems Stj\xc3\xa4rna', '\n\nStatistik \xc3\xb6ver datak\xc3\xa4llor\n',…
Niclas Nilsson
  • 5,691
  • 3
  • 30
  • 43
13
votes
1 answer

Unicode string literals

C++11 introduces a new set of string literal prefixes (and even allows user-defined suffixes). On top of this, you can directly use Unicode escape sequences to code a certain symbol without having to worry about encoding. const char16_t* s16 =…
rubenvb
  • 74,642
  • 33
  • 187
  • 332
13
votes
3 answers

How to iterate over Unicode characters in Python 3?

I need to step through a Python string one character at a time, but a simple "for" loop gives me UTF-16 code units instead: str = "abc\u20ac\U00010302\U0010fffd" for ch in str: code = ord(ch) print("U+{:04X}".format(code)) That…
Ross Smith
  • 3,719
  • 1
  • 25
  • 22
13
votes
2 answers

Unicode characters having asymmetric upper/lower case. Why?

Why do the following three characters have not symmetric toLower, toUpper results /** * Written in the Scala programming language, typed into the Scala REPL. * Results commented accordingly. */ /* Unicode Character 'LATIN CAPITAL LETTER SHARP…
Tim Friske
  • 2,012
  • 1
  • 18
  • 28
13
votes
3 answers

copyright character in vim

I used to get this copyright symbol in vim earlier through some keys' combination. Can someone help me with it now? I simply fail to recollect it. Also, if possible, share some more of such characters... someone might need it sometime.
Shree
  • 4,627
  • 6
  • 37
  • 49
13
votes
5 answers

Can the French and Spanish special chars be held in a varchar?

French and Spanish have special chars in them that are not used in normal English (accented vowels and such). Are those chars supported in a varchar? Or do I need a nvarchar for them? (NOTE: I do NOT want a discussion on if I should use nvarchar or…
Vaccano
  • 78,325
  • 149
  • 468
  • 850
13
votes
5 answers

How to print a unicode string in python in Windows console

I'm working on a python application that can print text in multiple languages to the console in multiple platforms. The program works well on all UNIX platforms, but in windows there are errors printing unicode strings in command-line. There's…
yonix
  • 11,665
  • 7
  • 34
  • 52