Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
12
votes
3 answers

Will everything in the standard library treat strings as unicode in Python 3.0?

I'm a little confused about how the standard library will behave now that Python (from 3.0) is unicode-based. Will modules such as CGI and urllib use unicode strings or will they use the new 'bytes' type and just provide encoded data?
hacama
  • 339
  • 2
  • 5
12
votes
5 answers

Remove multiple BOMs from a file

I am using a Javascript file that is a concatenation of other JavaScript files. Unfortunately, the person who concatenated these JavaScript files together did not use the proper encoding when reading the file, and allowed a BOM for every single…
Macy Abbey
  • 3,877
  • 1
  • 20
  • 30
12
votes
1 answer

Unicode in button title in XCode

I'm trying to display the greek letter pi (unicode \u03C0) on a button as the title. When I try to set the title using drag'n'drop graphical editor, the word "\u03C0" shows. Is there some way to set unicode text in the graphical editor, or do I need…
user1157134
  • 121
  • 1
  • 1
  • 3
12
votes
2 answers

Is it safe to assume users can see unicode characters U+2716 and U+2714 in CSS content?

I'm wanting to use the characters ✖ (U+2716) and ✔ (U+2714) in my CSS for form validation purposes. Basically, if a field is valid/invalid, I use the after pseudo class to insert the corresponding symbol after the field. For example: .field:after { …
Philip Walton
  • 29,693
  • 16
  • 60
  • 84
12
votes
4 answers

rules for slugs and unicode

After researching a bit how the different way people slugify titles, I've noticed that it's often missing how to deal with non english titles. url encoding is very restrictive. See http://www.blooberry.com/indexdot/html/topics/urlencoding.htm So,…
bustrofedon
  • 281
  • 4
  • 15
12
votes
1 answer

I lose “unicodeness” when qDebug()ing after instancing a QApplication

I am losing the capability of printing unicode characters right after instancing a QApplication object. From the following code and having included all the needed libraries: int main(int argc, char** argv) { qDebug() << "aeiou áéíóú"; …
user1598585
12
votes
1 answer

python subprocess and unicode execv() arg 2 must contain only strings

I have a django site where I need to call a script using subprocess. The subprocess call works when I'm using ascii characters but when I try to issue arguments that are utf-8 encoded, I get an error: execv() arg 2 must contain only strings. The…
deecodameeko
  • 505
  • 7
  • 18
12
votes
4 answers

Python efficient obfuscation of string

I need to obfuscate lines of Unicode text to slow down those who may want to extract them. Ideally this would be done with a built in Python module or a small add-on library; the string length will be the same or less than the original; and the…
Tim
  • 187
  • 1
  • 1
  • 10
12
votes
2 answers

Detecting IME input before enter pressed in Javascript

I'm not even sure if this is possible, so apologies if it's a stupid question. I've set up an keyup callback through jQuery to run a function when a user types in an input box. It works fine for English. However when inputting text in…
benui
  • 6,440
  • 5
  • 34
  • 49
12
votes
5 answers

Can a PHP file name (or a dir in its full path) have UTF-8 characters?

I would like to access a PHP file whose name has UTF-8 characters in it. The file does not have a BOM in it. It just contains an echo statement that displays a few unicode characters. Accessing the PHP page from the browser (FireFox 3.0.8, IE7)…
Raleigh
  • 408
  • 1
  • 3
  • 9
12
votes
4 answers

Getting python to print in UTF8 on Windows XP with the console

I would like to configure my console on Windows XP to support UTF8 and to have python detect that and work with it. So far, my attempts: C:\Documents and Settings\Philippe>C:\Python25\python.exe Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC…
Philippe F
  • 11,776
  • 5
  • 29
  • 30
12
votes
1 answer

StreamReader is unable to correctly read extended character set (UTF8)

I am having an issue where I am unable to read a file that contains foreign characters. The file, I have been told, is encoded in UTF-8 format. Here is the core of my code: using (FileStream fileStream = fileInfo.OpenRead()) { using…
PolandSpring
  • 2,664
  • 7
  • 26
  • 35
12
votes
3 answers

Delphi2010: Writing code to assign Caption containing Unicode literal values or load unicode symbols from text file?

How to make a Unicode program in Delphi 2010? I have English Windows and "Current language for non-Unicode programs" is English too. Static controls look good but if I try to change them (Label.Caption := 'unicode value' or…
Michael
  • 475
  • 2
  • 9
  • 17
12
votes
6 answers

How to print tuples of unicode strings in original language (not u'foo' form)

I have a list of tuples of unicode objects: >>> t = [('亀',), ('犬',)] Printing this out, I get: >>> print t [('\xe4\xba\x80',), ('\xe7\x8a\xac',)] which I guess is a list of the utf-8 byte-code representation of those strings? but what I want to…
Daniel H
  • 9,895
  • 3
  • 19
  • 11
12
votes
3 answers

Parse a non-ascii (unicode) number-string as integer in .NET

I have a string containing a number in a non-ascii format e.g. unicode BENGALI DIGIT ONE (U+09E7) : "১" How do I parse this as an integer in .NET? Note: I've tried using int.Parse() specifying a bengali culture format with "bn-BD" as the…
James McCormack
  • 9,217
  • 3
  • 47
  • 57