Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
12
votes
8 answers

Inheriting and overriding functions of a std::string?

Since std::string is actually a typedef of a templated class, how can I override it? I want to make a UTF-8 std::string that will return the correct length, among other things.
jmasterx
  • 52,639
  • 96
  • 311
  • 557
12
votes
1 answer

wicked_pdf shows unknown character on unicode pdf conversion (ruby)

I'm trying to create a pdf from a html page using wicked_pdf (version 1.1) and wkhtmltopdf-binary gems. My html page contains a calendar emoji that displays well in the browser whatever font I use
rico1892
  • 123
  • 1
  • 6
12
votes
5 answers

How can I display a tux character in a shell script?

I realize this is very much of a long shot, but... In shell scripts on Macs I can display an Apple character. Is there any way to display a Tux character (or anything else associated with Linux) on Linux systems? The simplest solution would be if…
iconoclast
  • 21,213
  • 15
  • 102
  • 138
12
votes
2 answers

Python: Find equivalent surrogate pair from non-BMP unicode char

The answer presented here: How to work with surrogate pairs in Python? tells you how to convert a surrogate pair, such as '\ud83d\ude4f' into a single non-BMP unicode character (the answer being "\ud83d\ude4f".encode('utf-16',…
hilssu
  • 416
  • 4
  • 18
12
votes
3 answers

Why were the code points in the range of U+D800 to U+DFFF removed from the Unicode character set?

I am learning about UTF-16 encoding, and I have read that if you want to represent code points in the range of U+10000 to U+10FFFF, then you have to use surrogate pairs, which are in the range of U+D800 to U+DFFF. So let's say I want to encode the…
paul
  • 695
  • 7
  • 17
12
votes
4 answers

How to make the Java.awt.Robot type unicode characters? (Is it possible?)

We have a user provided string that may contain unicode characters, and we want the robot to type that string. How do you convert a string into keyCodes that the robot will use? How do you do it so it is also java version independant (1.3 ->…
Greg Domjan
  • 13,943
  • 6
  • 43
  • 59
12
votes
5 answers

Handling a Unicode String in Delphi Versions <= 2007

Background: This question relates to versions of Delphi below 2009 (ie without Unicode support built in). I have a specification that requires me to transmit a Unicode encoded string over a TCP connection but I do not have Delphi 2009. Question Is…
jamiei
  • 2,006
  • 3
  • 20
  • 28
12
votes
5 answers

Python: any way to perform this "hybrid" split() on multi-lingual (e.g. Chinese & English) strings?

I have strings that are multi-lingual consist of both languages that use whitespace as word separator (English, French, etc) and languages that don't (Chinese, Japanese, Korean). Given such a string, I want to separate the English/French/etc part…
Continuation
  • 12,722
  • 20
  • 82
  • 106
12
votes
4 answers

Convert unicode json to normal json in python

I got the following json: {u'a': u'aValue', u'b': u'bValue', u'c': u'cValue'} by doing request.json in my python code. Now, I want to convert the unicode json to normal json, something which should like this: {"a": "aValue", "b": "bValue", "c":…
Sanjiban Bairagya
  • 704
  • 2
  • 12
  • 33
12
votes
5 answers

How to Show Eastern Letter(Chinese Character) on SQL Server/SQL Reporting Services?

I need to insert chinese characters in my database but it always show ???? .. Example: Insert this record. 微波室外单元-Apple Then it became ??? Result: ??????-Apple I really Need Help...thanks in regard. I am using MSSQL Server 2008
Crimsonland
  • 2,194
  • 3
  • 24
  • 42
12
votes
3 answers

In UTF-16, UTF-16BE, UTF-16LE, is the endian of UTF-16 the computer's endianness?

UTF-16 is a two-byte character encoding. Exchanging the two bytes' addresses will produce UTF-16BE and UTF-16LE. But I find the name UTF-16 encoding exists in the Ubuntu gedit text editor, as well as UTF-16BE and UTF-16LE. With a C test program I…
hao.zhou
  • 131
  • 1
  • 1
  • 4
12
votes
2 answers

Where is the character encoding of a text file stored in Linux?

I know the short answer should be "nowhere", however there's something that doesn't quite add up in the following test 2. Test 1. In Gedit, I create a new file containing only the string "aàbï", I choose "Save As" and there's a selector for choosing…
matteo
  • 2,934
  • 7
  • 44
  • 59
12
votes
0 answers

How can I open a file that has a Chinese Filename in C?

For some reasons, I need to write files that have Chinese characters in the filenames(C language). For example: #include #include FILE *fout; int main(){ fout = fopen("乔布斯.txt", "w"); fprintf(fout, "Jobs"); //Do…
Jeffchan
  • 121
  • 7
12
votes
3 answers

What are the most difficult-to-render Unicode samples?

I'm trying to implement a cross-platform (desktop browsers, iOS, & Android) typography system that allows users to input any Unicode string. What are some strings I should use to stress-test my system and ensure the most nines of users will have a…
Ky -
  • 30,724
  • 51
  • 192
  • 308
12
votes
2 answers

Python CSV write to file unreadable in Excel (Chinese characters)

I am trying to performing text analysis on Chinese texts. The program is provided below. I got the result with unreadable characters such as 浜烘皯鏃ユ姤绀捐. And if I change the output file result.csv to result.txt, the characters are correct as 人民日报社论. So…
flyingmouse
  • 1,014
  • 3
  • 13
  • 29