Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
13
votes
5 answers

How do I print the string which __FILE__ expands to correctly?

Consider this program: #include int main() { printf("%s\n", __FILE__); return 0; } Depending on the name of the file, this program works - or not. The issue I'm facing is that I'd like to print the name of the current file in an…
Frerich Raabe
  • 90,689
  • 19
  • 115
  • 207
13
votes
2 answers

How do I read characters in a string as their UTF-32 decimal values?

I have, for example, this Unicode string, which consists of the Cyclone and the Japanese Castle defined in C# and .NET, which uses UTF-16 for its CLR string encoding: var value = ""; If you check this, you find very quickly that value.Length = 4…
Alexandru
  • 12,264
  • 17
  • 113
  • 208
13
votes
4 answers

How to Output Unicode Strings on the Windows Console

there are already a few questions relating to this problem. I think my question is a bit different because I don't have an actual problem, I'm only asking out of academic interest. I know that Windows's implementation of UTF-16 is sometimes…
Philipp
  • 48,066
  • 12
  • 84
  • 109
13
votes
1 answer

std::u32string conversion to/from std::string and std::u16string

I need to convert between UTF-8, UTF-16 and UTF-32 for different API's/modules and since I know have the option to use C++11 am looking at the new string types. It looks like I can use string, u16string and u32string for UTF-8, UTF-16 and UTF-32. I…
Fire Lancer
  • 29,364
  • 31
  • 116
  • 182
13
votes
6 answers

How can I open files containing accents in Java?

(editing for clarification and adding some code) Hello, We have a requirement to parse data sent from users all over the world. Our Linux systems have a default locale of en_US.UTF-8. However, we often receive files with diacritical marks in their…
Mark Juric
  • 131
  • 1
  • 1
  • 5
13
votes
2 answers

Is it possible to have SQL Server convert collation to UTF-8 / UTF-16

In a project I am working on my data is stored in SQL Server, with the collation Danish_Norwegian_CI_AS. The data is output'ed through FreeTDS and ODBC, to python that handles the data as UTF-8. Some of the characters, like å, ø and æ, are not being…
Rookie
  • 1,590
  • 5
  • 20
  • 34
13
votes
4 answers

How to deal with Polish Characters while using regex?

I have street name as KRZYWOŃ ANIELI and so what should be my regex to allow this kind of expression. Currently I have simple one which uses /^[a-zA-Z ]+$/ Kindly advise.
Rachel
  • 100,387
  • 116
  • 269
  • 365
13
votes
2 answers

How do I send Unicode text from MATLAB into a Word document via the ActiveX interface?

I'm using MATLAB to programmatically create a Microsoft Word document on Windows. In general this solution works fine, but it is having trouble with non-ASCII text. For example, take this code: wordApplication =…
Matthew Simoneau
  • 6,199
  • 6
  • 35
  • 46
13
votes
1 answer

Python removing punctuation from unicode string except apostrophe

I found several topics of this and I found this solution: sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence) This should remove every punctuation except ', the problem is it also strips everything else from the sentence. Example: >>> sentence="warhol's…
KameeCoding
  • 693
  • 2
  • 9
  • 27
13
votes
6 answers

How to check if the word is Japanese or English using PHP

I want to have different process for English word and Japanese word in this function function process_word($word) { if($word is english) { ///////// }else if($word is japanese) { //////// } } thank you
bbnn
  • 3,505
  • 10
  • 50
  • 68
13
votes
2 answers

'str' does not support the buffer interface Python3 from Python2

Hi have this two funtions in Py2 works fine but it doesn´t works on Py3 def encoding(text, codes): binary = '' f = open('bytes.bin', 'wb') for c in text: binary += codes[c] f.write('%s' % binary) print('Text in binary:',…
Daniel Domingo
  • 159
  • 1
  • 1
  • 3
13
votes
4 answers

How to fix broken utf-8 encoding in Python?

My string is Niệm Bồ Tát (Thiá»n sư Nhất Hạnh) and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh). I see in that site can do that http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx and I start to try by Python mystr =…
giaosudau
  • 2,211
  • 6
  • 33
  • 64
13
votes
2 answers

json.dump - UnicodeDecodeError: 'utf8' codec can't decode byte 0xbf in position 0: invalid start byte

I have a dictionary data where I have stored: key - ID of an event value - the name of this event, where value is a UTF-8 string Now, I want to write down this map into a json file. I tried with this: with open('events_map.json', 'w') as…
Belphegor
  • 4,456
  • 11
  • 34
  • 59
13
votes
5 answers

Possible values for __STDC_ISO_10646__

What are the possible values of the __STDC_ISO_10646__ macro? Wikipedia has a list of the versions of ISO 10646 corresponding to different Unicode versions, but with only the year, not the month, and the macro includes a month value. Edit: Since…
R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
13
votes
2 answers

Don't argparse read unicode from commandline?

Running Python 2.7 When executing: $ python client.py get_emails -a "åäö" I get: usage: client.py get_emails [-h] [-a AREA] [-t {rfc2822,plain}] client.py get_emails: error: argument -a/--area: invalid unicode value:…
Niclas Nilsson
  • 5,691
  • 3
  • 30
  • 43