Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
13
votes
2 answers

Remove or match a Unicode Zero Width Space PHP

I have a text in Burmese language, UTF-8. I am using PHP to work with the text. At some point along the way, some ZWSPs have crept in and I would like to remove them. I have tried two different ways of removing the characters, and neither seems…
Jimmy Long
  • 688
  • 2
  • 9
  • 23
13
votes
3 answers

Unicode vs Multi-byte

I'm really confused by this unicode vs multi-byte thing. Say I'm compiling my program in Unicode (but ultimately, I want a solution that is independent of the character set used). 1) Will all 'char' be interpreted as wide characters? 2) If I have a…
Rayne
  • 151
  • 1
  • 1
  • 4
13
votes
1 answer

Display width of unicode strings in Python

How can I determine the display width of a Unicode string in Python 3.x, and is there a way to use that information to align those strings with str.format()? Motivating example: Printing a table of strings to the console. Some of the strings contain…
Christian Aichinger
  • 6,989
  • 4
  • 40
  • 60
13
votes
2 answers

Unicode character usage statistics

I am looking for some statistical data on the usage of Unicode characters in textual documents (with any markup). Googling brought no results. Background: I am currently developing a finite state machine-based text processing tool. Statistical data…
lexicore
  • 42,748
  • 17
  • 132
  • 221
13
votes
9 answers

replace emoji unicode symbol using regexp in javascript

As you all know emoji symbols are coded up to 3 or 4 bytes, so it may occupy 2 symbols in my string. For example 'wew'.length = 7 I want to find those symbols in my text and replace them to the value that is dependent from its code. Reading SO, I…
Fedor Skrynnikov
  • 5,521
  • 4
  • 28
  • 32
13
votes
7 answers

Ruby 1.9 doesn't support Unicode normalization yet

I'm trying to port over some of my old rails apps to Ruby 1.9 and I keep getting warnings about how "Ruby 1.9 doesn't support Unicode normalization yet." I've tracked it down to this function, but I'm getting about 20 warning messages per…
go minimal
  • 1,693
  • 5
  • 25
  • 42
13
votes
4 answers

An equivalent to string.ascii_letters for unicode strings in python 2.x?

In the "string" module of the standard library, string.ascii_letters ## Same as string.ascii_lowercase + string.ascii_uppercase is 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ' Is there a similar constant which would include everything…
emm
  • 265
  • 3
  • 11
13
votes
1 answer

Select unicode character u2028 in mysql 5.1

I am trying to select unicdode character /u2028 in MySQL 5.1. MySQL 5.1 does support utf8 and ucs2. In newer versions of MySQL i could select the char just be using utf16 or utf32 collation: SELECT char(0x2028 using utf16); SELECT char(0x00002028…
jelhan
  • 6,149
  • 1
  • 19
  • 35
13
votes
1 answer

Pass a list of string from Django to Javascript

My Django objects have an attribute "City". I'm trying to get a list of cities and catch it in the template with Jquery (to use in a chart on the X axis). My problem is that I can't get rid of the unicode and quote for a list. (I manage to do it for…
xavier carbonel
  • 339
  • 1
  • 3
  • 8
13
votes
4 answers

HTML unicode arrow works on Safari desktop, but not Safari for iOS

I'm using the ❯ arrow on a page, and it renders properly on Chrome, Firefox and Safari on OS X, however in Safari on iOS (iPhone), the arrows render as empty boxes (you know, the "unable to render" box). Any ideas on why this is happening and what I…
james.spinella
  • 241
  • 1
  • 3
  • 11
13
votes
2 answers

How to input Unicode character in Rails console?

While using Rails console, when I input ä, \U+FFC3\U+FFA4 appears. Of course I can input Unicode characters outside of rails. I'm using Ruby 2.0.0p247, Rails 4.0.0 in Max OS X 10.7.5. How can I input Unicode characters in Rails console?
ironsand
  • 14,329
  • 17
  • 83
  • 176
13
votes
3 answers

Javascript: Non-unicode char code to unicode character?

I'm having a character code issue with a barcode scanner used to input characters to a web interface. If a barcode has a symbol such as - (a dash/hyphen/minus) it gives me character code 189 which is correct in many character sets. Indeed, if I have…
13
votes
4 answers

How to find and count emoticons in a string using python?

This topic has been addressed for text based emoticons at link1, link2, link3. However, I would like to do something slightly different than matching simple emoticons. I'm sorting through tweets that contain the emoticons' icons. The following…
blehman
  • 1,870
  • 7
  • 28
  • 39
13
votes
2 answers

How to display unicode in SVG?

An information stored in SVG format in the database. If the data contains text it will be displayed as Unicode. It is necessary to correctly display the SVG files in the browser.
adelak
  • 647
  • 4
  • 11
  • 25
13
votes
5 answers

Unicode filenames on Windows with Python & subprocess.Popen()

Why does the following occur: >>> u'\u0308'.encode('mbcs') #UMLAUT '\xa8' >>> u'\u041A'.encode('mbcs') #CYRILLIC CAPITAL LETTER KA '?' >>> I have a Python application accepting filenames from the operating system. It works for some…
Norman
  • 581
  • 1
  • 5
  • 10