Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

U+0041 A
U+0042 B
U+0043 C
...
U+039B Λ
U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

UTF FAQ, UTF-16 FAQ, UTF-8 FAQ

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Latest Version of the Standard

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions

votes

5 answers

How do I print the string which FILE expands to correctly?

Consider this program: #include int main() { printf("%s\n", __FILE__); return 0; } Depending on the name of the file, this program works - or not. The issue I'm facing is that I'd like to print the name of the current file in an…

c++ c unicode winapi

asked Jul 20 '10 at 14:32

Frerich Raabe

90,689
19
115
207

votes

2 answers

How do I read characters in a string as their UTF-32 decimal values?

I have, for example, this Unicode string, which consists of the Cyclone and the Japanese Castle defined in C# and .NET, which uses UTF-16 for its CLR string encoding: var value = ""; If you check this, you find very quickly that value.Length = 4…

c# .net unicode encoding

asked Aug 21 '15 at 13:29

Alexandru

12,264
17
113
208

votes

4 answers

How to Output Unicode Strings on the Windows Console

there are already a few questions relating to this problem. I think my question is a bit different because I don't have an actual problem, I'm only asking out of academic interest. I know that Windows's implementation of UTF-16 is sometimes…

windows unicode console

asked Jun 28 '10 at 08:29

Philipp

48,066
12
84
109

votes

1 answer

std::u32string conversion to/from std::string and std::u16string

I need to convert between UTF-8, UTF-16 and UTF-32 for different API's/modules and since I know have the option to use C++11 am looking at the new string types. It looks like I can use string, u16string and u32string for UTF-8, UTF-16 and UTF-32. I…

c++ linux windows c++11 unicode

asked Jul 08 '15 at 19:58

Fire Lancer

29,364
31
116
182

votes

6 answers

How can I open files containing accents in Java?

(editing for clarification and adding some code) Hello, We have a requirement to parse data sent from users all over the world. Our Linux systems have a default locale of en_US.UTF-8. However, we often receive files with diacritical marks in their…

java unicode character-encoding

asked Jun 18 '10 at 18:58

Mark Juric

votes

2 answers

Is it possible to have SQL Server convert collation to UTF-8 / UTF-16

In a project I am working on my data is stored in SQL Server, with the collation Danish_Norwegian_CI_AS. The data is output'ed through FreeTDS and ODBC, to python that handles the data as UTF-8. Some of the characters, like å, ø and æ, are not being…

sql-server unicode utf-8 collation pyodbc

asked May 16 '15 at 21:47

Rookie

1,590
5
20
34

votes

4 answers

How to deal with Polish Characters while using regex?

I have street name as KRZYWOŃ ANIELI and so what should be my regex to allow this kind of expression. Currently I have simple one which uses /^[a-zA-Z ]+$/ Kindly advise.

php regex unicode

asked Jun 10 '10 at 14:35

Rachel

100,387
116
269
365

votes

2 answers

How do I send Unicode text from MATLAB into a Word document via the ActiveX interface?

I'm using MATLAB to programmatically create a Microsoft Word document on Windows. In general this solution works fine, but it is having trouble with non-ASCII text. For example, take this code: wordApplication =…

matlab unicode ms-word activex

asked May 08 '15 at 21:12

Matthew Simoneau

6,199
6
35
46

votes

1 answer

Python removing punctuation from unicode string except apostrophe

I found several topics of this and I found this solution: sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence) This should remove every punctuation except ', the problem is it also strips everything else from the sentence. Example: >>> sentence="warhol's…

python regex unicode punctuation

asked Apr 28 '15 at 21:29

KameeCoding

votes

6 answers

How to check if the word is Japanese or English using PHP

I want to have different process for English word and Japanese word in this function function process_word($word) { if($word is english) { ///////// }else if($word is japanese) { //////// } } thank you

php unicode multibyte

asked May 18 '10 at 11:54

bbnn

3,505
10
50
68

votes

2 answers

'str' does not support the buffer interface Python3 from Python2

Hi have this two funtions in Py2 works fine but it doesn´t works on Py3 def encoding(text, codes): binary = '' f = open('bytes.bin', 'wb') for c in text: binary += codes[c] f.write('%s' % binary) print('Text in binary:',…

python string python-3.x unicode arrays

asked Nov 15 '14 at 12:02

Daniel Domingo

votes

4 answers

How to fix broken utf-8 encoding in Python?

My string is Niá»‡m Bá»“ TÃ¡t (Thiá»n sÆ° Nháº¥t Háº¡nh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx and I start to try by Python mystr =…

python unicode utf-8 character-encoding

asked Oct 21 '14 at 16:17

giaosudau

2,211
6
33
64

votes

2 answers

json.dump - UnicodeDecodeError: 'utf8' codec can't decode byte 0xbf in position 0: invalid start byte

I have a dictionary data where I have stored: key - ID of an event value - the name of this event, where value is a UTF-8 string Now, I want to write down this map into a json file. I tried with this: with open('events_map.json', 'w') as…

python json unicode encoding utf-8

asked Aug 04 '14 at 15:34

Belphegor

4,456
11
34
59

votes

5 answers

Possible values for __STDC_ISO_10646__

What are the possible values of the __STDC_ISO_10646__ macro? Wikipedia has a list of the versions of ISO 10646 corresponding to different Unicode versions, but with only the year, not the month, and the macro includes a month value. Edit: Since…

c unicode iso

asked Jun 23 '14 at 01:22

R.. GitHub STOP HELPING ICE

208,859
35
376
711

votes

2 answers

Don't argparse read unicode from commandline?

Running Python 2.7 When executing: $ python client.py get_emails -a "åäö" I get: usage: client.py get_emails [-h] [-a AREA] [-t {rfc2822,plain}] client.py get_emails: error: argument -a/--area: invalid unicode value:…

python unicode argparse

asked Apr 08 '14 at 20:10

Niclas Nilsson

5,691
3
30
43

Prev 1 2 3

…

99 100 Next