Questions tagged [surrogate-pairs]

Unicode characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called **surrogate pairs**.

Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

  • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;
  • the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;
  • the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
111 questions
9
votes
2 answers

Handling Unicode surrogate values in Java strings

Consider the following code: byte aBytes[] = { (byte)0xff,0x01,0,0, (byte)0xd9,(byte)0x65, (byte)0x03,(byte)0x04, (byte)0x05, (byte)0x06, (byte)0x07, (byte)0x17,(byte)0x33, (byte)0x74,…
user49598
9
votes
1 answer

What are surrogate characters in UTF-8?

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined…
Gherman
  • 6,768
  • 10
  • 48
  • 75
8
votes
2 answers

How to convert surrogate pair to Unicode scalar in Swift

The following example is taken from the Strings and Characters documentation: The values 55357 (U+D83D in hex) and 56374 (U+DC36 in hex) are the surrogate pairs that form the Unicode scalar U+1F436, which is the DOG FACE character. Is there any way…
Suragch
  • 484,302
  • 314
  • 1,365
  • 1,393
8
votes
1 answer

How to do surrogateescape in python2

Python3 changed the unicode behaviour to deny surrogate pairs while python2 not. There's a question here But it do not supply a solution on how to remove surrogate pairs in python2 or how to do surrogate escape. Python3 example: >>> a =…
lxyu
  • 2,661
  • 5
  • 23
  • 29
8
votes
2 answers

How to read non-BMP (astral) Unicode supplementary characters (code points)

The G-Clef (U+1D11E) is not part of the Basic Multilingual Plane (BMP), which means that it requires more than 16 bit. Almost all of Java's read functions return only a char or a int containing also only 16 bit. Which function reads complete Unicode…
ceving
  • 21,900
  • 13
  • 104
  • 178
7
votes
1 answer

Emojis to/from codepoints in Javascript

In a hybrid Android/Cordova game that I am creating I let users provide an identifier in the form of an Emoji + an alphanumeric - i.e. 0..9,A..Z,a..z - name. For example ‍️Stackoverflow Server-side the user identifiers are stored with the Emoji and…
DroidOS
  • 8,530
  • 16
  • 99
  • 171
7
votes
1 answer

Check if UTF-8 string is valid in modern C++

It is known that the standard library of C++11 allows to easily convert a string from UTF-8 encoding to UTF-16. However, the following code successfully converts invalid UTF-8 input (at least under MSVC2010): #include #include…
stgatilov
  • 5,333
  • 31
  • 54
7
votes
2 answers

Surrogate Pair Detection Fails

I'm working on a minor side project in F# which involves porting existing C# code to F# and I've seemingly come across a difference in how regular expressions are handled between the two languages (I'm posting this to hopefully find out I am just…
Sven Grosen
  • 5,616
  • 3
  • 30
  • 52
6
votes
2 answers

How to iterate over only the characters in a string I can actually see?

Normally I would just use something like str[i]. But what if str = "☀️"? str[i] fails. for (x of str) console.log(x) also fails. It prints out a total of 4 characters, even though there are clearly only 2 emoji in the string. What's the best way to…
thedayturns
  • 9,723
  • 5
  • 33
  • 41
6
votes
3 answers

Difference between composite characters and surrogate pairs

In Unicode what is the difference between composite characters and surrogate pairs? To me they sound like similar things - two characters to represent one character. What differentiates these two concepts?
Sachin Kainth
  • 45,256
  • 81
  • 201
  • 304
5
votes
1 answer

Checking for illegal surrogates in Python 3 strings

Specifically in Python 3.3 and above, is it sufficient to check for orphan surrogates by using the simple match: re.search(r'[\uD800-\uDFFF]', s) Based on the assumption that all legal surrogates would have been represented as astral code points…
Basel Shishani
  • 7,735
  • 6
  • 50
  • 67
5
votes
2 answers

C#: how to get first character of a string?

We already have a question about getting the first 16-bit char of a string. This includes the question code: MyString.ToCharArray[0] and accepted answer code: MyString[0] I guess there are some uses for that, but when the string contains text we…
hippietrail
  • 15,848
  • 18
  • 99
  • 158
5
votes
4 answers

How to reverse a string that contains surrogate pairs

I have written this method to reverse a string public string Reverse(string s) { if(string.IsNullOrEmpty(s)) return s; TextElementEnumerator enumerator = …
Sachin Kainth
  • 45,256
  • 81
  • 201
  • 304
5
votes
1 answer

Simplest way to extract first Unicode codepoint of an NSString (outside the BMP)?

For historical reasons, Cocoa's Unicode implementation is 16-bit: it handles Unicode characters above 0xFFFF via "surrogate pairs". This means that the following code is not going to work: NSString myString = @""; uint32_t codepoint = [myString…
Quuxplusone
  • 23,928
  • 8
  • 94
  • 159
4
votes
3 answers

How can I store UTF-16 characters in a Postgres database?

I am trying to store some text (e.g. č) in a Postgres database, however when retrieving this value, it appears on screen as ?. I'm not sure why it does this, I was under the impression that it was a character that wasn't supported in UTF-8, but was…
Mr Shoubs
  • 14,629
  • 17
  • 68
  • 107