Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
12
votes
6 answers

Delphi XE - should I use String or AnsiString?

I finally upgraded to Delphi XE. I have a library of units where I use strings to store plain ANSI characters (chars between A and U). I am 101% sure that I will never ever use UNICODE characters in those places. I want to convert all other…
Gabriel
  • 20,797
  • 27
  • 159
  • 293
12
votes
4 answers

Java regex always fails

I have a Java regex pattern and a sentence I'd like to completely match, but for some sentencecs it erroneously fails. Why is this? (for simplicity, I won't use my complex regex, but just ".*") System.out.println(Pattern.matches(".*",…
Zom-B
  • 233
  • 1
  • 6
12
votes
4 answers

Are you fluent in Unicode yet?

Almost 5 years ago Joel Spolsky wrote this article, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)". Like many, I read it carefully, realizing it was high-time I got to…
Ash
  • 60,973
  • 31
  • 151
  • 169
12
votes
4 answers

Read a file with unicode characters

I have an asp.net c# page and am trying to read a file that has the following charater ’ and convert it to '. (From slanted apostrophe to apostrophe). FileInfo fileinfo = new FileInfo(FileLocation); string content =…
chris
  • 3,783
  • 3
  • 17
  • 13
12
votes
5 answers

Help me understand why Unicode only works sometimes with Python

Here's a little program: #!/usr/bin/env python # -*- encoding: utf-8 -*- print('abcd kΩ ☠ °C √Hz µF ü ☃ ♥') print(u'abcd kΩ ☠ °C √Hz µF ü ☃ ♥') On Ubuntu, Gnome terminal, IPython does what I would expect: In [6]: run Unicodetest.py abcd kΩ ☠ °C…
endolith
  • 25,479
  • 34
  • 128
  • 192
12
votes
5 answers

Japanese/chinese email addresses?

I'm making some site which must be fully unicode. Database etc are working, i only have some small logic error. Im testing my register form with ajax if fields are valid, in email field i check with regular expressions. However if a user has a email…
Writecoder
  • 613
  • 2
  • 8
  • 27
12
votes
2 answers

Why isn't there a "Medium Small Black Circle" in Unicode

I know this is maybe off-topic on SO, but I don't know where else to ask. The Unicode blocks Miscellaneous Symbols and Miscellanous Symbols and Arrows contain these characters: HEAVY LARGE CIRCLE (U+2B55) ⭕ (before emojis it used to look like…
m93a
  • 8,866
  • 9
  • 40
  • 58
12
votes
2 answers

Javascript unicode (greek) regular expressions

I would like to use this regular expression new RegExp("\b"+pat+"\b") in greek text but the "\b" metacharacter supports only ASCII characters. I tried XregExp library but i didnt manage to solve the issue. Any suggestions would be greatly…
kylito
  • 121
  • 1
  • 4
12
votes
5 answers

Get Unicode characters with charcode values greater hex `FFFF`

Issue The ChrW charcode argument is a Long that identifies a character, but doesn't allow values greater than 65535 (hex value &HFFFF) - see MS Help. For instance Miscellaneous symbols and pictographs can be found at Unicode hex block 1F300-1F5FF.…
T.M.
  • 9,436
  • 3
  • 33
  • 57
12
votes
3 answers

How to Convert a javascript object to utf-8 Blob for download?

I've been trying to find a solution that works but couldn't find one. I have an object in javascript and it has some non-english characters in it. I'm trying the following code to convert the object to a blob for download. When I click to download…
Loves2Develop
  • 774
  • 1
  • 8
  • 29
12
votes
5 answers

How to parse unicode strings with minidom?

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling…
dariopy
  • 568
  • 1
  • 7
  • 18
12
votes
2 answers

Using unicode characters as shape

I'd like to use unicode characters as the shape of plots in ggplot, but for unknown reason they're not rendering. I did find a similar query here, but I can't make the example there work either. Any clues as to why? Note that I don't want to use…
Laserhedvig
  • 381
  • 2
  • 13
12
votes
2 answers

R write.csv with UTF-16 encoding

I'm having trouble outputting a data.frame using write.csv using UTF-16 character encoding. Background: I am trying to write out a CSV file from a data.frame for use in Excel. Excel Mac 2011 seems to dislike UTF-8 (if I specify UTF-8 during text…
Daniel Dickison
  • 21,832
  • 13
  • 69
  • 89
12
votes
3 answers

Using middle-dot ASCII with proper support?

I'm using the middle dot - · - a lot in my website. The ASCII is ·, which works fine. However, there are still some problems with some users not seeing the symbol. Is there a very close but more widely supported symbol like this, or is there a…
AKor
  • 8,550
  • 27
  • 82
  • 136
12
votes
5 answers

How do you match accented and tilde characters in a perl regular expression (regexp)?

A user enters a set of names with accents and tildes: Renato Núñez, David DeJesús, and Edwin Encarnación My database has anglicized names for these people @names = ('Renato Nunez','David DeJesus','Edwin Encarnacion'); I wish to do a regexp match…
Sean
  • 645
  • 1
  • 6
  • 21