Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

U+0041 A
U+0042 B
U+0043 C
...
U+039B Λ
U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

UTF FAQ, UTF-16 FAQ, UTF-8 FAQ

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Latest Version of the Standard

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions

votes

5 answers

How to convert unicode accented characters to pure ascii without accents?

I'm trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the…

python unicode wget unicode-normalization

asked Jan 02 '13 at 07:28

Wolf

votes

4 answers

Java - Assign unicode apostrophe to char

I want to assign the value of aphostrophe to a char: char a = '\''; However I would like to use the unicode version of apostrophe (\u0027) to keep it consistent with my code: char a = '\u0027'; But doing it this way gives an error saying "unclosed…

java unicode

asked Dec 03 '12 at 23:00

priomsrb

2,602
3
26
34

votes

5 answers

How can I reverse a string that contains combining characters in Perl?

I have the string "re\x{0301}sume\x{0301}" (which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r" (émusér). I can't use Perl's reverse because it treats combining characters like "\x{0301}" as separate characters,…

perl unicode string reverse

asked Aug 28 '09 at 14:47

Chas. Owens

64,182
22
135
226

votes

3 answers

Python 3: Demystifying encode and decode methods

Let's say I have a string in Python: >>> s = 'python' >>> len(s) 6 Now I encode this string like this: >>> b = s.encode('utf-8') >>> b16 = s.encode('utf-16') >>> b32 = s.encode('utf-32') What I get from above operations is a bytes array -- that…

python unicode encoding python-3.x

asked Nov 20 '12 at 08:57

treecoder

43,129
22
67
91

votes

5 answers

wchar_t is unsigned or signed

In this link unsigned wchar_t is typedefed as WCHAR. But I cant find this kind of typedef in my SDK winnt.h or mingw winnt.h. wchar_t is signed or unsigned? I am using WINAPIs in C language.

c winapi unicode wchar-t

asked Aug 14 '12 at 13:29

2vision2

4,933
16
83
164

votes

2 answers

How do I escape unicode character 0x1F in xml?

I need to write a text with the unicode character 0x1F in a utf-8 document (it is not an allowed character in xml). Is there a way to escape it, or do I have to discard it?

xml unicode

asked Jul 23 '09 at 07:47

Filip

votes

4 answers

If Ascii operators are definable, why not Unicode Symbols?

I'm sure I join many in being glad there's finally a powerful language tied tightly to a mainstream GUI/Database/Communication framework. I haven't been sure where to post this, but here seems the best spot. I need to use Unicode symbol…

unicode f# operators symbols localization

asked Jul 21 '09 at 09:07

Michael Ginn

votes

1 answer

What are the limitations of primitive character types in D?

I am currently exploring the specification of the Digital Mars D language, and am having a little trouble understanding the complete nature of the primitive character types. The book Learn to Tango With D is similarly vague on the capabilities and…

unicode utf-8 d primitive-types utf

asked Jul 12 '09 at 17:33

Ian Gilham

1,916
3
20
31

votes

7 answers

Whitespace gone from PDF extraction, and strange word interpretation

Using the snippet below, I've attempted to extract the text data from this PDF file. import pyPdf def get_text(path): # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pages content = "" for i in…

python pdf unicode pypdf

asked Jun 18 '12 at 17:16

Louis Thibault

20,240
25
83
152

votes

5 answers

UTF-8 file output in R

I'm using R 2.15.0 on Windows 7 64-bit. I would like to output unicode (CJK) text to a file. The following code shows how a Unicode character sent to write on a UTF-8 file connection does not work as (I) expected: rty <-…

r unicode cjk

asked May 20 '12 at 16:56

Patrick

votes

7 answers

How can I re-add a unicode byte order marker in linux?

I have a rather large SQL file which starts with the byte order marker of FFFE. I have split this file using the unicode aware linux split tool into 100,000 line chunks. But when passing these back to windows, it does not like any of the parts other…

linux bash unicode

asked Jun 25 '09 at 15:31

Neil Trodden

4,724
6
35
55

votes

2 answers

How do I detect if a file is encoded using UTF-8?

Is there a way to recognize if text file is UTF-8 in Python? I would really like to get if the file is UTF-8 or not. I don't need to detect other encodings.

python unicode utf-8 character-encoding

asked Apr 14 '12 at 18:16

Riki137

2,076
2
23
26

votes

2 answers

Previewing unicode fonts on Linux

Is there a tool on Linux that would allow me to preview Unicode fonts. Fontforge allows me to see the available glyphs and Unicode ranges, but the display is very crude. Gnome font viewer shows only the Latin range. Ideally the tool would accept a…

linux unicode fonts utility

asked Mar 23 '12 at 05:05

Basel Shishani

7,735
6
50
67

votes

2 answers

How can I get Mocha's Unicode output to display properly in a Windows console?

When I run Mocha, it tries to show a check mark or an X for a passing or a failing test run, respectively. I've seen great-looking screenshots of Mocha's output. But those screenshots were all taken on Macs or Linux. In a console window on Windows,…

powershell unicode mocha.js windows-console

asked Mar 22 '12 at 07:16

Joe White

94,807
60
220
330

votes

3 answers

With C++11, do I still need a non-standard string manipulation library for Unicode text?

I've noticed the length method of std::string returns the length in bytes and the same method in std::u16string returns the number of 2-byte sequences. I've also noticed that when a character or code point is outside of the BMP, length returns 4…

c++ unicode c++11

asked Feb 28 '12 at 04:59

user1237077

Prev 1 2 3

…

99 100 Next