Questions tagged [surrogate-pairs]

Unicode characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called **surrogate pairs**.

Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;
the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;
the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.

111 questions

votes

2 answers

Handling Unicode surrogate values in Java strings

Consider the following code: byte aBytes[] = { (byte)0xff,0x01,0,0, (byte)0xd9,(byte)0x65, (byte)0x03,(byte)0x04, (byte)0x05, (byte)0x06, (byte)0x07, (byte)0x17,(byte)0x33, (byte)0x74,…

java unicode surrogate-pairs

asked Jun 08 '09 at 16:45

user49598

votes

1 answer

What are surrogate characters in UTF-8?

I have a strange validation program that validates wheather a utf-8 string is a valid host name(Zend Framework Hostname valdiator in PHP). It allows IDNs(internationalized domain names). It will compare each subdomain with sets of characters defined…

utf-8 utf surrogate-pairs

asked Jun 23 '18 at 12:27

Gherman

6,768
10
48
75

votes

2 answers

How to convert surrogate pair to Unicode scalar in Swift

The following example is taken from the Strings and Characters documentation: The values 55357 (U+D83D in hex) and 56374 (U+DC36 in hex) are the surrogate pairs that form the Unicode scalar U+1F436, which is the DOG FACE character. Is there any way…

ios swift unicode scalar surrogate-pairs

asked Jul 08 '15 at 02:47

Suragch

484,302
314
1,365
1,393

votes

1 answer

How to do surrogateescape in python2

Python3 changed the unicode behaviour to deny surrogate pairs while python2 not. There's a question here But it do not supply a solution on how to remove surrogate pairs in python2 or how to do surrogate escape. Python3 example: >>> a =…

python unicode python-2.x surrogate-pairs

asked Oct 29 '13 at 04:06

lxyu

2,661
5
23
29

votes

2 answers

How to read non-BMP (astral) Unicode supplementary characters (code points)

The G-Clef (U+1D11E) is not part of the Basic Multilingual Plane (BMP), which means that it requires more than 16 bit. Almost all of Java's read functions return only a char or a int containing also only 16 bit. Which function reads complete Unicode…

java unicode codepoint surrogate-pairs supplementary

asked Jun 28 '13 at 09:14

ceving

21,900
13
104
178

votes

1 answer

Emojis to/from codepoints in Javascript

In a hybrid Android/Cordova game that I am creating I let users provide an identifier in the form of an Emoji + an alphanumeric - i.e. 0..9,A..Z,a..z - name. For example ‍️Stackoverflow Server-side the user identifiers are stored with the Emoji and…

javascript emoji utf-16 surrogate-pairs

asked Nov 04 '19 at 10:05

DroidOS

8,530
16
99
171

votes

1 answer

Check if UTF-8 string is valid in modern C++

It is known that the standard library of C++11 allows to easily convert a string from UTF-8 encoding to UTF-16. However, the following code successfully converts invalid UTF-8 input (at least under MSVC2010): #include #include…

c++ utf-8 surrogate-pairs

asked Jan 14 '17 at 17:28

stgatilov

5,333
31
54

votes

2 answers

Surrogate Pair Detection Fails

I'm working on a minor side project in F# which involves porting existing C# code to F# and I've seemingly come across a difference in how regular expressions are handled between the two languages (I'm posting this to hopefully find out I am just…

.net regex unicode f# surrogate-pairs

asked Mar 31 '15 at 02:05

Sven Grosen

5,616
3
30
52

votes

2 answers

How to iterate over only the characters in a string I can actually see?

Normally I would just use something like str[i]. But what if str = "☀️"? str[i] fails. for (x of str) console.log(x) also fails. It prints out a total of 4 characters, even though there are clearly only 2 emoji in the string. What's the best way to…

javascript unicode surrogate-pairs astral-plane

asked Apr 22 '16 at 04:40

thedayturns

9,723
5
33
41

votes

3 answers

Difference between composite characters and surrogate pairs

In Unicode what is the difference between composite characters and surrogate pairs? To me they sound like similar things - two characters to represent one character. What differentiates these two concepts?

unicode utf-16 surrogate-pairs

asked Mar 01 '14 at 22:23

Sachin Kainth

45,256
81
201
304

votes

1 answer

Checking for illegal surrogates in Python 3 strings

Specifically in Python 3.3 and above, is it sufficient to check for orphan surrogates by using the simple match: re.search(r'[\uD800-\uDFFF]', s) Based on the assumption that all legal surrogates would have been represented as astral code points…

regex python-3.x unicode surrogate-pairs

asked Sep 14 '15 at 11:42

Basel Shishani

7,735
6
50
67

votes

2 answers

C#: how to get first character of a string?

We already have a question about getting the first 16-bit char of a string. This includes the question code: MyString.ToCharArray[0] and accepted answer code: MyString[0] I guess there are some uses for that, but when the string contains text we…

c# string unicode utf-16 surrogate-pairs

asked Apr 24 '15 at 04:17

hippietrail

15,848
18
99
158

votes

4 answers

How to reverse a string that contains surrogate pairs

I have written this method to reverse a string public string Reverse(string s) { if(string.IsNullOrEmpty(s)) return s; TextElementEnumerator enumerator = …

c# string reverse utf-16 surrogate-pairs

asked Mar 01 '14 at 13:00

Sachin Kainth

45,256
81
201
304

votes

1 answer

Simplest way to extract first Unicode codepoint of an NSString (outside the BMP)?

For historical reasons, Cocoa's Unicode implementation is 16-bit: it handles Unicode characters above 0xFFFF via "surrogate pairs". This means that the following code is not going to work: NSString myString = @""; uint32_t codepoint = [myString…

cocoa nsstring surrogate-pairs

asked Oct 08 '12 at 20:05

Quuxplusone

23,928
8
94
159

votes

3 answers

How can I store UTF-16 characters in a Postgres database?

I am trying to store some text (e.g. č) in a Postgres database, however when retrieving this value, it appears on screen as ?. I'm not sure why it does this, I was under the impression that it was a character that wasn't supported in UTF-8, but was…

.net postgresql encoding utf-16 surrogate-pairs

asked Dec 09 '11 at 16:29

Mr Shoubs

14,629
17
68
107

Prev 1

3 4 5 6 7 8 Next