Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
12
votes
6 answers

Writing utf16 to file in binary mode

I'm trying to write a wstring to file with ofstream in binary mode, but I think I'm doing something wrong. This is what I've tried: ofstream outFile("test.txt", std::ios::out | std::ios::binary); wstring hello = L"hello"; outFile.write((char *)…
Cactuar
  • 395
  • 2
  • 6
  • 14
12
votes
5 answers

PHP function imagettftext() and unicode

I'm using the PHP function imagettftext() to convert text into a GIF image. The text I am converting has Unicode characters including Japanese. Everything works fine on my local machine (Ubuntu 7.10), but on my webhost server, the Japanese…
gerdemb
  • 11,275
  • 17
  • 65
  • 73
12
votes
2 answers

How can I replace UTF-8 errors in Ruby without converting to a different encoding?

In order to convert a string to UTF-8 and replace all encoding errors, you can do: str.encode('utf-8', :invalid=>:replace) The only problem with this is it doesn't work if str is already UTF-8, in which case any errors remain: irb> x =…
Matt
  • 21,026
  • 18
  • 63
  • 115
12
votes
3 answers

strange UnicodeDecodeError on django

Was doing a fresh install of my vagrant box and my dev environment and when trying to run my django project I get the following error. Any ideas whats going on? ---------------------------------------- [21/Sep/2013 23:44:03] code 400, message Bad…
jzkelter
  • 131
  • 1
  • 4
12
votes
1 answer

Set up Notepad++ and NppExec to print unicode characters from python

I have an utf-8 encoded file cjk.py: print("打印") Unsurprisingly, running python cjk.py yields Traceback (most recent call last): File "cjk.py", line 1, in print('\u6253\u5370') File "C:\Python33\lib\encodings\cp850.py", line 19, in…
Clément
  • 12,299
  • 15
  • 75
  • 115
12
votes
3 answers

UnicodeDecodeError: unexpected end of data

I have a huge text file which I want to open. I'm reading the file in chunks, avoiding memory issues related to reading too much of the file all at once. code snippet: def open_delimited(fileName, args): with open(fileName, args,…
Presen
  • 1,809
  • 4
  • 31
  • 46
12
votes
7 answers

The encoding 'UTF-8' is not supported by the Java runtime

Whenever I start our Apache Felix (OSGi) based application under SUN Java ( build 1.6.0_10-rc2-b32 and other 1.6.x builds) I see the following message output on the console (usually under Ubuntu 8.4): Warning: The encoding 'UTF-8' is not supported…
Mark Derricutt
  • 979
  • 1
  • 11
  • 20
12
votes
1 answer

How to convert a char to its full Unicode name?

I need functions to convert between a character (e.g. 'α') and its full Unicode name (e.g. "GREEK SMALL LETTER ALPHA") in both directions. The solution I came up with is to perform a lookup in the official Unicode Standard available online:…
Oksana Gimmel
  • 937
  • 8
  • 13
12
votes
2 answers

How to query MySQL for fields containing null characters

I have a MySQL table with a text column. Some rows have null characters (0x00) as part of this text column (along with other characters). I am looking for a query that will return all rows containing any null characters for this column, but I…
CJS
  • 1,455
  • 1
  • 13
  • 17
12
votes
2 answers

Ruby's String#gsub, unicode, and non-word characters

As part of a larger series of operations, I'm trying to take tokenized chunks of a larger string and get rid of punctuation, non-word gobbledygook, etc. My initial attempt used String#gsub and the \W regexp character class, like so: my_str =…
Steven Bedrick
  • 663
  • 2
  • 8
  • 16
12
votes
2 answers

Unicode Encoding and decoding issues in QRCode

I am trying to generate UTF-8 QRCode so that I can encore accents and Unicode characters. To test it, I am using many decoding solution : http://zxing.org/w/decode.jspx - The zxing project also used in…
Natim
  • 17,274
  • 23
  • 92
  • 150
12
votes
2 answers

Allowed characters in CSS 'content' property?

I've read that we must use Unicode values inside the content CSS property i.e. \ followed by the special character's hexadecimal number. But what characters, other than alphanumerics, are actually allowed to be placed as is in the value of content…
its_me
  • 10,998
  • 25
  • 82
  • 130
12
votes
4 answers

Java Can't Open a File with Surrogate Unicode Values in the Filename?

I'm dealing with code that does various IO operations with files, and I want to make it able to deal with international filenames. I'm working on a Mac with Java 1.5, and if a filename contains Unicode characters that require surrogates, the JVM…
Bear
  • 121
  • 1
  • 1
  • 3
12
votes
3 answers

Why does Java use modified UTF-8 instead of UTF-8?

Why does Java use modified UTF-8 rather than standard UTF-8 for object serialization and JNI? One possible explanation is that modified UTF-8 can't have embedded null characters and therefore one can use functions that operate on null-terminated…
vitaut
  • 49,672
  • 25
  • 199
  • 336
12
votes
1 answer

Issue about 65533 � in C# text file reading

I created a sample app to load all special characters while copy pasting from Openoffice writer to Notepad. Double codes differs and when I try to load this. var lines = File.ReadAllLines("..\\ter34.txt"); This creates problem of 65533 Issue comes…
Aravind Srinivas
  • 251
  • 3
  • 8
  • 15
1 2 3
99
100