Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
12
votes
2 answers

how to decode an ascii string with backslash x \x codes

I am trying to decode from a Brazilian Portogese text: 'Demais Subfun\xc3\xa7\xc3\xb5es 12' It should be 'Demais Subfunções 12' >> a.decode('unicode_escape') >> a.encode('unicode_escape') >> a.decode('ascii') >> a.encode('ascii') all…
Davoud Taghawi-Nejad
  • 16,142
  • 12
  • 62
  • 82
12
votes
1 answer

How to avoid Zalgo text bleeding all over place without totally removing it?

Our web service has been hit with some Zalgo text and I'm trying to come up with a good solution for the future. Our policy is to accept all user input and save it in permanent storage (we correctly encode the input for our backend so this part is…
Mikko Rantalainen
  • 14,132
  • 10
  • 74
  • 112
12
votes
4 answers

'Wide character in subroutine entry" - UTF-8 encoded cyrillic words as sequence of bytes

I am working on an Android word game with a large dictionary - The words (over 700 000) are kept as separate lines in a text file (and then put in an SQLite database). To protect my dictionary, I'd like to encode all words which are longer than 3…
Alexander Farber
  • 21,519
  • 75
  • 241
  • 416
12
votes
5 answers

How to write Russian characters in file?

In console when I'm trying output Russian characters It gives me ??????????????? Who know why? I tried write to file - in this case the same situation. for example f=open('tets.txt','w') f.write('some russian text') f.close inside file is -…
Pol
  • 24,517
  • 28
  • 74
  • 95
12
votes
2 answers

Unicode paths with MATLAB

Given the following code that attempts to create 2 folders in the current MATLAB path: %% u_path1 = native2unicode([107, 97, 116, 111, 95, 111, 117, 116, 111, 117], 'UTF-8'); % 'kato_outou' u_path2 = native2unicode([233 129 142, 230 184 161, 229…
user2271770
12
votes
1 answer

Python: Getting rid of \u200b from a string using regular expressions

I have a web scraper that takes forum questions, splits them into individual words and writes it to the text file. The words are stored in a list of tuples. Each tuple contains the word and its frequency. Like so... [(u'move', 3), (u'exploration',…
ceilingfan999
  • 133
  • 1
  • 1
  • 6
12
votes
5 answers

How do I use unicode (UTF-8) characters in Clojure regular expressions?

This is a double question for you amazingly kind Stacked Overflow Wizards out there. How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL? At the moment I cannot send any non-roman characters…
ivar
  • 1,484
  • 12
  • 20
12
votes
3 answers

Adding Arial Unicode MS to CKEditor

My web application allows user to write rich text inside CKEditor, then export the result as PDF with the Flying Saucer library. As they need to write Greek characters, I chose to add Arial Unicode MS to the available fonts, by doing the following :…
realUser404
  • 2,111
  • 3
  • 20
  • 38
12
votes
7 answers

List of unicode character names

In Python I can print a unicode character by name (e.g. print(u'\N{snowman}')). Is there a way I get get a list of all valid names?
Miki Tebeka
  • 13,428
  • 4
  • 37
  • 49
12
votes
6 answers

Is there a faster way to clean out control characters in a file?

Previously, I had been cleaning out data using the code snippet below import unicodedata, re, io all_chars = (unichr(i) for i in xrange(0x110000)) control_chars = ''.join(c for c in all_chars if unicodedata.category(c)[0] == 'C') cc_re =…
alvas
  • 115,346
  • 109
  • 446
  • 738
12
votes
4 answers

How to use Unicode (UTF-8) in C++

Possible Duplicate: Unicode in C++ If I remembered correctly, the default character and string encoding in C++ are ASCII. Is there a simple way to enable Unicode support?
segfault
  • 5,759
  • 9
  • 45
  • 66
12
votes
1 answer

Inconsistent Unicode Emoji Glyphs/Symbols

I've been trying to make use of the Unicode symbols for astrology in products for both Apple and iOS. I'm getting inconsistent results, as shown here: Most of these are coming out as I like, but for some reason the Taurus symbol is appearing one…
Apollo Grace
  • 361
  • 3
  • 16
12
votes
3 answers

How to use unicode symbols in matplotlib?

import matplotlib.pyplot as pyplot pyplot.figure() pyplot.xlabel(u"\u2736") pyplot.show() Here is the simplest code I can create to show my problem. The axis label symbol is meant to be a six-pointed star but it shows as a box. How do I change it…
Chris H
  • 123
  • 1
  • 1
  • 4
12
votes
4 answers

CSS reference to phone's Emoji font?

I want to use this specific emoji in my web page 🔍 - 🔍 On Android, the browser recognises the Unicode glyph as an Emoji, and displays. On the desktop it renders as a Unicode fallback character - a little square with numbers in. So, using…
Terence Eden
  • 14,034
  • 3
  • 48
  • 89
12
votes
2 answers

What's the best way to embed a Unicode character in a POSIX shell script?

There's several shell-specific ways to include a ‘unicode literal’ in a string. For instance, in Bash, the quoted string-expanding mechanism, $'', allows us to directly embed an invisible character: $'\u2620'. However, if you're trying to write…
ELLIOTTCABLE
  • 17,185
  • 12
  • 62
  • 78