Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
13
votes
3 answers

How to make MySQL aware of multi-byte characters in LIKE and REGEXP?

I have a MySQL table with two columns, both utf8_unicode_ci collated. It contains the following rows. Except for ASCII, the second field also contains Unicode codepoints like U+02C8 (MODIFIED LETTER VERTICAL LINE) and U+02D0 (MODIFIED LETTER…
Tim
  • 13,904
  • 10
  • 69
  • 101
13
votes
2 answers

How can I audit my Windows application for correct Unicode handling?

I can't use prepackaged Unicode string libraries, such as ICU, because they blow up the size of the binary to an insane degree (it's a 200k program; ICU is 16MB+!). I'm using the builtin wchar_t string type for everything already, but I want to…
Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
13
votes
4 answers

If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?

On the Unicode site it's written that UTF-8 can be represented by 1-4 bytes. As I understand from this question https://softwareengineering.stackexchange.com/questions/77758/why-are-there-multiple-unicode-encodings UTF-8 is an 8-bits encoding. So,…
Sergey
  • 11,548
  • 24
  • 76
  • 113
13
votes
5 answers

What's the point of String.normalize()?

While reviewing JavaScript concepts, I found String.normalize(). This is not something that shows up in W3School's "JavaScript String Reference", and, hence, it is the reason I might have missed it before. I found more information about it in…
Tiago Martins Peres
  • 14,289
  • 18
  • 86
  • 145
13
votes
2 answers

Cyrillic alphabet validation

I came across an interesting defect today the issue is I have a deployment of my web application in Russia and the name value "Наталья" is not returning true as alphaNumeric in the method below. Curious for some input on how people would approach a…
Duncan Krebs
  • 3,366
  • 2
  • 33
  • 53
13
votes
2 answers

How to correctly display unicode characters in VS Code's Integrated Terminal?

As per title, I can't seem to get VS Code Integrated Terminal to correctly display unicode characters. They always show up as question marks (?) in the integrated terminal. I've ensured that the files are saved with encoding UTF-8 which seemed to…
Sheng Ying
  • 131
  • 1
  • 1
  • 5
13
votes
2 answers

Proper Way to Insert Strings to a SQLAlchemy Unicode Column

I have a SQLAlchemy model with a Unicode column. I sometimes insert unicode values to it (u'Value'), but also sometimes insert ASCII strings. What is the best way to go about this? When I insert ASCII strings with special characters I get this…
Raiders
  • 181
  • 2
  • 7
13
votes
2 answers

What is the difference between "combining characters" and "modifier letters"?

In the Unicode standard, there are diacritical marks, such as U+0302, COMBINING CIRCUMFLEX ACCENT (◌̂), and U+02C6, MODIFIER LETTER CIRCUMFLEX ACCENT (ˆ). I know that combining characters are combined with the previous letter to, say, make a letter…
Greg
  • 224
  • 3
  • 10
13
votes
2 answers

Unicode identifiers (function names) for non-localization purposes advisable?

PHP allows Unicode identifiers for variables, functions, classes and constants anyhow. It was certainly intended for localized applications. Wether it's a good idea to code an API in anything but English is debatable, but it's undisputed that some…
mario
  • 144,265
  • 20
  • 237
  • 291
13
votes
4 answers

A resilient, actually working CSV implementation for non-ascii?

[Update] Appreciate the answers and input all around, but working code would be most welcome. If you can supply code that can read the sample files you are king (or queen). [Update 2] Thanks for the excellent answers and discussion. What I need to…
Parand
  • 102,950
  • 48
  • 151
  • 186
13
votes
1 answer

What is the efficient, standards-compliant mechanism for processing Unicode using C++17?

Short version: If I wanted to write program that can efficiently perform operations with Unicode characters, being able to input and output files in UTF-8 or UTF-16 encodings. What is the appropriate way to do this with C++? Long version: C++…
Poeta Kodu
  • 1,120
  • 8
  • 16
13
votes
7 answers

Is there an STL string class that properly handles Unicode?

I know all about std::string and std::wstring but they don't seem to fully pay attention to extended character encoding of UTF-8 and UTF-16 (On windows at least). There is also no support for UTF-32. So does anyone know of cross-platform drop-in…
Goz
  • 61,365
  • 24
  • 124
  • 204
13
votes
4 answers

Windows Console and Qt Unicode Text

I spent a whole day trying to figure this out with no luck. I looked Everywhere but no luck with working code. OS: Win XP Sp2 IDE & FRAMEWORK: C++, Qt Creator 2.0. I am trying to output some unicode (UTF-8) text to the windows console but all I see…
user440297
  • 1,181
  • 4
  • 23
  • 33
13
votes
3 answers

python: unicode problem

I am trying to decode a string I took from file: file = open ("./Downloads/lamp-post.csv", 'r') data =…
Oleg Tarasenko
  • 9,324
  • 18
  • 73
  • 102
13
votes
3 answers

Detect if character is simplified or traditional Chinese character

I found this question which gives me the ability to check if a string contains a Chinese character. I'm not sure if the unicode ranges are correct but they seem to return false for Japanese and Korean and true for Chinese. What it doesn't do is tell…
thenengah
  • 42,557
  • 33
  • 113
  • 157