Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
115
votes
9 answers

Python Unicode Encode Error

I'm reading and parsing an Amazon XML file and while the XML file shows a ' , when I try to print it I get the following error: 'ascii' codec can't encode character u'\u2019' in position 16: ordinal not in range(128) From what I've read online…
Alex B
  • 1,185
  • 2
  • 8
  • 5
114
votes
6 answers

How to resolve TypeError: can only concatenate str (not "int") to str

I decided to make some kind of secret code for testing purposes with Unicode. I've done that by adding numbers to Unicode so it would be kind of secret. I've been getting this error, but I don't know how to solve it. Is there any…
9ae
  • 1,169
  • 2
  • 7
  • 4
114
votes
4 answers

How can I iterate through the unicode codepoints of a Java String?

So I know about String#codePointAt(int), but it's indexed by the char offset, not by the codepoint offset. I'm thinking about trying something like: using String#charAt(int) to get the char at an index testing whether the char is in the…
rampion
  • 87,131
  • 49
  • 199
  • 315
113
votes
8 answers

How to set emoji by unicode in a textview?

Hi I'd like to do the following: ??? unicode = U+1F60A String emoji = getEmojiByUnicode(unicode) String text = "So happy " textview.setText(text + emoji); to get this in my textview: So happy How can I implement getEmojiByUnicode(unicode)? What…
Gilbert Giesbert
  • 3,338
  • 3
  • 13
  • 10
113
votes
16 answers

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1

I'm having a few issues trying to encode a string to UTF-8. I've tried numerous things, including using string.encode('utf-8') and unicode(string), but I get the error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1: ordinal…
Markum
  • 3,919
  • 8
  • 26
  • 30
112
votes
13 answers

Java equivalent to JavaScript's encodeURIComponent that produces identical output?

I've been experimenting with various bits of Java code trying to come up with something that will encode a string containing quotes, spaces and "exotic" Unicode characters and produce output that's identical to JavaScript's encodeURIComponent…
John Topley
  • 113,588
  • 46
  • 195
  • 237
112
votes
4 answers

Python str vs unicode types

Working with Python 2.7, I'm wondering what real advantage there is in using the type unicode instead of str, as both of them seem to be able to hold Unicode strings. Is there any special reason apart from being able to set Unicode codes in unicode…
Caumons
  • 9,341
  • 14
  • 68
  • 82
112
votes
11 answers

How do I sort unicode strings alphabetically in Python?

Python sorts by byte value by default, which means é comes after z and other equally funny things. What is the best way to sort alphabetically in Python? Is there a library for this? I couldn't find anything. Preferrably sorting should have language…
Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
111
votes
0 answers

How to compare 'μ' and 'µ' in C#

I fall into a surprising issue. I loaded a text file in my application and I have some logic which compares the value having µ. And I realized that even if the texts are same the compare value is false. Console.WriteLine("μ".Equals("µ")); //…
D J
  • 6,908
  • 13
  • 43
  • 75
111
votes
17 answers

Most robust method for showing Icon next to text

There are different ways to show graphics in a page next to text. I need to include a graphic/icon that indicates a new tab will be opened. I know it's possible to do using at least these different methods: Unicode character from default…
Lee Englestone
  • 4,545
  • 13
  • 51
  • 85
111
votes
12 answers

How to make the python interpreter correctly handle non-ASCII characters in string operations?

I have a string that looks like so: 6 918 417 712 The clear cut way to trim this string (as I understand Python) is simply to say the string is in a variable called s, we get: s.replace(' ', '') That should do the trick. But of course it…
adergaard
  • 1,221
  • 2
  • 9
  • 7
110
votes
5 answers

Trouble with UTF-8 characters; what I see is not what I stored

I tried to use UTF-8 and ran into trouble. I have tried so many things; here are the results I have gotten: ???? instead of Asian characters. Even for European text, I got Se?or for Señor. Strange gibberish (Mojibake?) such as Señor or…
Rick James
  • 135,179
  • 13
  • 127
  • 222
110
votes
10 answers

Get a list of all the encodings Python can encode to

I am writing a script that will try encoding bytes into many different encodings in Python 2.6. Is there some way to get a list of available encodings that I can iterate over? The reason I'm trying to do this is because a user has some text that is…
Amandasaurus
  • 58,203
  • 71
  • 188
  • 248
110
votes
5 answers

Displaying unicode symbols in HTML

I want to simply display the tick (✔) and cross (✘) symbols in a HTML page but it shows up as either a box or goop ✔ - obviously something to do with the encoding. I have set the meta tag to show utf-8 but obviously I'm missing something.
Peter Craig
  • 7,101
  • 19
  • 59
  • 74
109
votes
9 answers

Unicode Processing in C++

What is the best practice of Unicode processing in C++?
Fortepianissimo
  • 3,317
  • 5
  • 21
  • 15