Questions tagged [unicode]

Unicode is a standard for the encoding, representation and handling of text with the intention of supporting all the characters required for written text incorporating all writing systems, technical symbols and punctuation.

Unicode

Unicode assigns each character a code point to act as a unique reference:

  • U+0041 A
  • U+0042 B
  • U+0043 C
  • ...
  • U+039B Λ
  • U+039C Μ

Unicode Transformation Formats

UTFs describe how to encode code points as byte representations. The most common forms are UTF-8 (which encodes code points as a sequence of one, two, three or four bytes) and UTF-16 (which encodes code points as two or four bytes).

Code Point          UTF-8           UTF-16 (big-endian)
U+0041              41              00 41
U+0042              42              00 42
U+0043              43              00 43
...
U+039B              CE 9B           03 9B
U+039C              CE 9C           03 9C

Specification

The Unicode Consortium also defines standards for sorting algorithms, rules for capitalization, character normalization and other locale-sensitive character operations.

Identifying Characters

For more general information, see the Unicode article on Wikipedia.

Related Tags

24916 questions
13
votes
3 answers

Regex to Match Horizontal White Spaces

I need a regex in Python2 to match only horizontal white spaces not newlines. \s matches all whitespaces including newlines. >>> re.sub(r"\s", "", "line 1.\nline 2\n") 'line1.line2' \h does not work at all. >>> re.sub(r"\h", "", "line 1.\nline…
Memduh
  • 836
  • 8
  • 18
13
votes
4 answers

Convert Array of UnicodeScalar into String in Swift

I have an array of unicode scalars (Type is [UnicodeScalar]) like: let array = [UnicodeScalar("f")!, UnicodeScalar("o")!, UnicodeScalar("o")!] or let array2 = "bar".unicodeScalars how can I convert efficiently these arrays into a strings again?…
nacho4d
  • 43,720
  • 45
  • 157
  • 240
13
votes
3 answers

Is it actually possible to store and process individual UTF-8 characters on C ? If so, how?

I've written a program in C that breaks words down into syllables, segments and letters. It's working well with ASCII characters but I want to make versions that work for the IPA and Arabic too. I'm having massive problems saving and performing…
sally2000
  • 788
  • 4
  • 14
13
votes
2 answers

Devanagari text rendering improperly in PyGame

We have a small web app that we want to convert into something native. Right now, it's got a lot of moving parts (the backend, the browser etc.) and we'd like to convert it into a single tight application. We decided to use PyGame to do this and…
Noufal Ibrahim
  • 71,383
  • 13
  • 135
  • 169
13
votes
3 answers

Unicode with knitr and Rmarkdown

Is there a set of best practices or documentation for working with Unicode in knitr and Rmarkdown? I can't seem to get any glyphs to show up properly when knitting a document. For example, this works in the console (in Rstudio): > cat("\U2660 …
user2987808
  • 1,387
  • 1
  • 12
  • 28
13
votes
2 answers

Printing a Unicode Symbol in C

I'm trying to print a unicode star character (0x2605) in a linux terminal using C. I've followed the syntax suggested by other answers on the site, but I'm not getting an output: #include #include int main(){ wchar_t star =…
Luke Collins
  • 1,433
  • 3
  • 18
  • 36
13
votes
2 answers

Why is the output of print in python2 and python3 different with the same string?

In python2: $ python2 -c 'print "\x08\x04\x87\x18"' | hexdump -C 00000000 08 04 87 18 0a |.....| 00000005 In python3: $ python3 -c 'print("\x08\x04\x87\x18")' | hexdump -C 00000000 08 04 c2 87 18 0a …
lzutao
  • 409
  • 5
  • 13
13
votes
4 answers

Why isn't the Byte Order Mark emitted from UTF8Encoding.GetBytes?

The snippet says it all :-) UTF8Encoding enc = new UTF8Encoding(true/*include Byte Order Mark*/); byte[] data = enc.GetBytes("a"); // data has length 1. // I expected the BOM to be included. What's up?
xyz
  • 27,223
  • 29
  • 105
  • 125
13
votes
4 answers

How do I paste non-ASCII characters into vim?

My terminal emulator is configured for Unicode character encoding and my .vimrc contains the line set encoding=utf-8 but when I try pasting the word "café" into vim, it comes out as "café". I can make an "é" in vim by typing Ctrl-vu followed by…
sferik
  • 1,795
  • 2
  • 15
  • 22
13
votes
4 answers

Characters appear as question marks in MySQL

I have a problem saving unicode characters in MySql. Initially my database character set was set to latin1 and unicode strings were saves as quotation marks. After doing some research I added the following lines to…
yinjia
  • 804
  • 2
  • 10
  • 20
13
votes
6 answers

Delphi WideString and Delphi 2009+

I am writing a class that will save wide strings to a binary file. I'm using Delphi 2005 for this but the app will later be ported to Delphi 2010. I'm feeling very unsure here, can someone confirm that: A Delphi 2005 WideString is exactly the same…
David
  • 467
  • 7
  • 13
13
votes
7 answers

findstr or grep that autodetects chararacter encoding (UTF-16)

I want to do this: findstr /s /c:some-symbol * or the grep equivalent grep -R some-symbol * but I need the utility to autodetect files encoded in UTF-16 (and friends) and search them appropriately. My files even have the byte-ordering mark…
David Martin
  • 181
  • 1
  • 2
  • 7
13
votes
2 answers

Sequence of logical OR in ES6/Unicode regular expression in Chrome ✗ vs Firefox ✓

Consider the following Unicode-heavy regular expression (emoji standing in for non-ASCII and extra-BMP characters): ''.match(/||/ug) Firefox returns [ "", "", "", "", "", "" ] . Chrome 52.0.2743.116 and Node 6.4.0 both return null! It doesn’t seem…
Ahmed Fasih
  • 6,458
  • 7
  • 54
  • 95
13
votes
1 answer

Matching Unicode word boundaries in Python

In order to match the Unicode word boundaries [as defined in the Annex #29] in Python, I have been using the regex package with flags regex.WORD | regex.V1 (regex.UNICODE should be default since the pattern is a Unicode string) in the following…
ewcz
  • 12,819
  • 1
  • 25
  • 47
13
votes
4 answers

Unable to translate bytes [FC] at index 35 from specified code page to Unicode

I'm trying to send an object like this to my REST API(built with asp net core) { "firstName":"tersü", "lastName":"asda" } And this is how the headers form SoapUI look: Accept-Encoding: gzip,deflate Content-Type:…
DVM
  • 1,229
  • 3
  • 16
  • 22