Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
13
votes
5 answers

How to convert unicode accented characters to pure ascii without accents?

I'm trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the…
Wolf
  • 141
  • 1
  • 2
  • 9
13
votes
4 answers

Java - Assign unicode apostrophe to char

I want to assign the value of aphostrophe to a char: char a = '\''; However I would like to use the unicode version of apostrophe (\u0027) to keep it consistent with my code: char a = '\u0027'; But doing it this way gives an error saying "unclosed…
priomsrb
  • 2,602
  • 3
  • 26
  • 34
13
votes
5 answers

How can I reverse a string that contains combining characters in Perl?

I have the string "re\x{0301}sume\x{0301}" (which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r" (émusér). I can't use Perl's reverse because it treats combining characters like "\x{0301}" as separate characters,…
Chas. Owens
  • 64,182
  • 22
  • 135
  • 226
13
votes
3 answers

Python 3: Demystifying encode and decode methods

Let's say I have a string in Python: >>> s = 'python' >>> len(s) 6 Now I encode this string like this: >>> b = s.encode('utf-8') >>> b16 = s.encode('utf-16') >>> b32 = s.encode('utf-32') What I get from above operations is a bytes array -- that…
treecoder
  • 43,129
  • 22
  • 67
  • 91
13
votes
5 answers

wchar_t is unsigned or signed

In this link unsigned wchar_t is typedefed as WCHAR. But I cant find this kind of typedef in my SDK winnt.h or mingw winnt.h. wchar_t is signed or unsigned? I am using WINAPIs in C language.
2vision2
  • 4,933
  • 16
  • 83
  • 164
13
votes
2 answers

How do I escape unicode character 0x1F in xml?

I need to write a text with the unicode character 0x1F in a utf-8 document (it is not an allowed character in xml). Is there a way to escape it, or do I have to discard it?
Filip
  • 153
  • 1
  • 1
  • 4
13
votes
4 answers

If Ascii operators are definable, why not Unicode Symbols?

I'm sure I join many in being glad there's finally a powerful language tied tightly to a mainstream GUI/Database/Communication framework. I haven't been sure where to post this, but here seems the best spot. I need to use Unicode symbol…
Michael Ginn
13
votes
1 answer

What are the limitations of primitive character types in D?

I am currently exploring the specification of the Digital Mars D language, and am having a little trouble understanding the complete nature of the primitive character types. The book Learn to Tango With D is similarly vague on the capabilities and…
Ian Gilham
  • 1,916
  • 3
  • 20
  • 31
13
votes
7 answers

Whitespace gone from PDF extraction, and strange word interpretation

Using the snippet below, I've attempted to extract the text data from this PDF file. import pyPdf def get_text(path): # Load PDF into pyPDF pdf = pyPdf.PdfFileReader(file(path, "rb")) # Iterate pages content = "" for i in…
Louis Thibault
  • 20,240
  • 25
  • 83
  • 152
13
votes
5 answers

UTF-8 file output in R

I'm using R 2.15.0 on Windows 7 64-bit. I would like to output unicode (CJK) text to a file. The following code shows how a Unicode character sent to write on a UTF-8 file connection does not work as (I) expected: rty <-…
Patrick
  • 187
  • 1
  • 2
  • 7
13
votes
7 answers

How can I re-add a unicode byte order marker in linux?

I have a rather large SQL file which starts with the byte order marker of FFFE. I have split this file using the unicode aware linux split tool into 100,000 line chunks. But when passing these back to windows, it does not like any of the parts other…
Neil Trodden
  • 4,724
  • 6
  • 35
  • 55
13
votes
2 answers

How do I detect if a file is encoded using UTF-8?

Is there a way to recognize if text file is UTF-8 in Python? I would really like to get if the file is UTF-8 or not. I don't need to detect other encodings.
Riki137
  • 2,076
  • 2
  • 23
  • 26
12
votes
2 answers

Previewing unicode fonts on Linux

Is there a tool on Linux that would allow me to preview Unicode fonts. Fontforge allows me to see the available glyphs and Unicode ranges, but the display is very crude. Gnome font viewer shows only the Latin range. Ideally the tool would accept a…
Basel Shishani
  • 7,735
  • 6
  • 50
  • 67
12
votes
2 answers

How can I get Mocha's Unicode output to display properly in a Windows console?

When I run Mocha, it tries to show a check mark or an X for a passing or a failing test run, respectively. I've seen great-looking screenshots of Mocha's output. But those screenshots were all taken on Macs or Linux. In a console window on Windows,…
Joe White
  • 94,807
  • 60
  • 220
  • 330
12
votes
3 answers

With C++11, do I still need a non-standard string manipulation library for Unicode text?

I've noticed the length method of std::string returns the length in bytes and the same method in std::u16string returns the number of 2-byte sequences. I've also noticed that when a character or code point is outside of the BMP, length returns 4…
user1237077