Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
12
votes
3 answers

comfortable way to use unicode characters in a ggplot graph

Is there a good practice to insert unicode characters in a ggplot title and also save it as pdf? I am struggling with expression, paste and sprintf to get a nice title... So, what works is ggtitle(expression(paste('5', mu, 'g'))) This will print an…
drmariod
  • 11,106
  • 16
  • 64
  • 110
12
votes
2 answers

Can't get Czech characters while generating a PDF

I have a problem when adding characters such as "Č" or "Ć" while generating a PDF. I'm mostly using paragraphs for inserting some static text into my PDF report. Here is some sample code I used: var document = new…
perkes456
  • 1,163
  • 4
  • 25
  • 49
12
votes
1 answer

Evaluate UTF-8 literal escape sequences in a string in Python3

I have a string of the form: s = '\\xe2\\x99\\xac' I would like to convert this to the character ♬ by evaluating the escape sequence. However, everything I've tried either results in an error or prints out garbage. How can I force Python to convert…
Altay_H
  • 489
  • 6
  • 14
12
votes
1 answer

Beautiful Soup Unicode encode error

I am trying the following code with a particular HTML file from BeautifulSoup import BeautifulSoup import re import codecs import sys f = open('test1.html') html = f.read() soup = BeautifulSoup(html) body = soup.body.contents para =…
Rohit Banga
  • 18,458
  • 31
  • 113
  • 191
12
votes
4 answers

HTML unicode ☰ not detected in mobile web application menu in android chrome browser

i have a issue in my website menu in android mobile chrome browser that is not able to show unicode ☰ . but if i am check my web application in iPhone or other android browser it is rendering or working properly. I am used this icon in this…
Mohammed Javed
  • 866
  • 2
  • 9
  • 24
12
votes
4 answers

opencv imread() on Windows for non-ASCII file names

We have an OpenCV problem of opening (and writing) file paths that contain non-ASCII characters on Windows. Affected functions are: cv::imread(), cv::imwrite(), ... As far as I saw in the OpenCV source code, it uses fopen even on Windows (instead of…
Vyacheslav
  • 1,186
  • 2
  • 15
  • 29
12
votes
1 answer

How to write 3 bytes unicode literal in Java?

I'd like to write unicode literal U+10428 in Java. http://www.marathon-studios.com/unicode/U10428/Deseret_Small_Letter_Long_I I tried with '\u10428' and it doesn't compile.
kawty
  • 1,656
  • 15
  • 22
12
votes
7 answers

Why can't I use accented characters next to a word boundary?

I'm trying to make a dynamic regex that matches a person's name. It works without problems on most names, until I ran into accented characters at the end of the name. Example: Some Fancy Namé The regex I've used so far is: /\b(Fancy…
Rexxars
  • 1,167
  • 8
  • 10
12
votes
5 answers

Programmatically determine number of strokes in a Chinese character?

Does Unicode store stroke count information about Chinese, Japanese, or other stroke-based characters?
xkdkxdxc
  • 511
  • 5
  • 9
12
votes
2 answers

How to create SSL certificate with Unicode characters in the Organization name (or other fields)?

I've created a self-signed SSL certificate and have no trouble using it, but the browser (Firefox, Chrome/IE) shows garbled characters in the Organization's name (anything above ASCII has 2 characters). I created the certificate in a Debian running…
vesperto
  • 804
  • 1
  • 6
  • 26
12
votes
4 answers

Read/Write file with unicode file name with plain C++/Boost

I want to read / write a file with a unicode file name using boost filesystem, boost locale on Windows (mingw) (should be platform independent at the end). This is my code: #include #define BOOST_NO_CXX11_SCOPED_ENUMS #include…
Mike M
  • 2,263
  • 3
  • 17
  • 31
12
votes
2 answers

Python: solving unicode hell with unidecode

I have been working on ways to flatten text into ascii. So ā -> a and ñ -> n, etc. unidecode has been fantastic for this. # -*- coding: utf-8 -*- from unidecode import unidecode print(unidecode(u"ā, ī, ū, ś, ñ")) print(unidecode(u"Estado de São…
e h
  • 8,435
  • 7
  • 40
  • 58
12
votes
1 answer

Error writing a file with file.write in Python. UnicodeEncodeError

I have never dealt with encoding and decoding strings, so I am quite the newbie on this front. I am receiving a UnicodeEncodeError when I try to write the contents I read from another file to a temporary file using file.write in Python. I get the…
user2643864
  • 641
  • 3
  • 11
  • 24
12
votes
3 answers

Unicode problems in C++ but not C

I'm trying to write unicode strings to the screen in C++ on Windows. I changed my console font to Lucida Console and I set the output to CP_UTF8 aka 65001. I run the following code: #include //notice this header file.. #include…
Brandon
  • 22,723
  • 11
  • 93
  • 186
12
votes
3 answers

"surrogateescape" cannot escape certain characters

Regarding reading and writing text files in Python, one of the main Python contributors mentions this regarding the surrogateescape Unicode Error Handler: [surrogateescape] handles decoding errors by squirreling the data away in a little used part…
dotancohen
  • 30,064
  • 36
  • 138
  • 197
1 2 3
99
100