Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
13
votes
4 answers

freebcp: "Unicode data is odd byte size for column. Should be even byte size"

This file works fine (UTF-8): $ cat ok.txt 291054 Ţawī Rifā This file causes an error (UTF-8): $ cat bad.txt 291054 Ţawī Rifā‘ Here's the message: $ freebcp 'DB.dbo.table' in bad.txt ... -c Starting copy... Msg 20050, Level 4 Attempt to convert…
Neil McGuigan
  • 46,580
  • 12
  • 123
  • 152
13
votes
2 answers

How to convert Unicode Character to Int in Swift

A user asked the following question to one of my answers. I have a unicode character \u{0D85}. How do I get the Int value from it? I was going to refer them to another Stack Overflow Q&A but I couldn't find one. These refer to converting the…
Suragch
  • 484,302
  • 314
  • 1,365
  • 1,393
13
votes
2 answers

Decoding if it's not unicode

I want my function to take an argument that could be an unicode object or a utf-8 encoded string. Inside my function, I want to convert the argument to unicode. I have something like this: def myfunction(text): if not isinstance(text, unicode): …
Manuel Ceron
  • 8,268
  • 8
  • 31
  • 38
13
votes
1 answer

How can I truncate a string to have at most N characters?

The expected approach of String.truncate(usize) fails because it doesn't consider Unicode characters (which is baffling considering Rust treats strings as Unicode). let mut s = "ボルテックス".to_string(); s.truncate(4); thread '' panicked at 'assertion…
Peter Uhnak
  • 9,617
  • 5
  • 38
  • 51
13
votes
1 answer

How to pad and align unicode strings with special characters in python?

Python makes it easy to pad and align ascii strings, like so: >>> print "%20s and stuff" % ("test") test and stuff >>> print "{:>20} and stuff".format("test") test and stuff But how can I properly pad and align…
camomilk
  • 763
  • 1
  • 7
  • 15
13
votes
2 answers

Why doesn't Perl v5.22 find all the sentence boundaries?

This is fixed in Perl 5.22.1. I write about it in Perl v5.22 adds fancy Unicode word boundaries. Perl v5.22 added the Unicode assertions from TR #29. I've been playing with the sentence boundary assertion, but it only seems to find the start and…
brian d foy
  • 129,424
  • 31
  • 207
  • 592
13
votes
1 answer

How to do Unicode escaping in YAML multiline string?

Is it possible to use Unicode character escaping (e.g. \u2009) in YAML multiline strings? this_escape_works: "foo\u2009bar" this_escape_doesnt: > foo\u2009bar
Sampo
  • 4,308
  • 6
  • 35
  • 51
13
votes
2 answers

How to decode a unicode string Python

What is the best way to decode an encoded string that looks like: u'u\xf1somestring' ? Background: I have a list that contains random values (strings and integers), I'm trying to convert every item in the list to a string then process each of…
mfalade
  • 1,647
  • 2
  • 17
  • 16
13
votes
1 answer

How to detect when bytes can't be converted to string in Go?

There are invalid byte sequences that can't be converted to Unicode strings. How do I detect that when converting []byte to string in Go?
codefx
  • 9,872
  • 16
  • 53
  • 81
13
votes
6 answers

UTF-8 or UTF-16 or UTF-32 or UCS-2

I am designing a new CMS but want to design it to fit all my future needs like Multilingual content so i was thinking Unicode (UTF-8) is the best solution But with some search i got this article…
Pola Edward
  • 528
  • 1
  • 4
  • 20
13
votes
4 answers

Emacs, unicode, xterm mouse escape sequences, and wide terminals

Short version: When using emacs' xterm-mouse-mode, Somebody (emacs? bash? xterm?) intercepts xterm's control sequences and replaces them with \0. This is a pain on wide monitors because only the first 223 columns have mouse. What is the culprit,…
Ryan
  • 731
  • 6
  • 17
13
votes
3 answers

Word wrapping in pango with mixed scripts

I have a text box implementation that uses pango. If i put a string that starts with a word in right-to-left script, followed by a space, followed by word in left-to-right based script, the word wrapping that pango uses gets messed up (using…
default
  • 2,637
  • 21
  • 44
13
votes
3 answers

How do I get the "visible" length of a combining Unicode string in Python?

If I have a Python Unicode string that contains combining characters, len reports a value that does not correspond to the number of characters "seen". For example, if I have a string with combining overlines and underlines such as…
orome
  • 45,163
  • 57
  • 202
  • 418
13
votes
3 answers

how to deal with unicode in mako?

I constantly get this error using mako: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 6: ordinal not in range(128) I've told mako I'm using unicode in any possible way: mylookup = TemplateLookup( …
Giorgio Gelardi
  • 993
  • 4
  • 13
  • 29
13
votes
4 answers

Parsing command line arguments in a unicode C++ application

How can I parse integers passed to an application as command line arguments if the app is unicode? Unicode apps have a main like this: int _tmain(int argc, _TCHAR* argv[]) argv[?] is a wchar_t*. That means i can't use atoi. How can I convert it to…
David Reis
  • 12,701
  • 7
  • 36
  • 42