Questions tagged [surrogate-pairs]

Unicode characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called **surrogate pairs**.

Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;
the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;
the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.

111 questions

votes

1 answer

How to Convert UTF-16 Surrogate Decimal to UNICODE in Java

I have some string data like � ;� ; These are surrogate pairs in UTF 16 in decimal format. How can I convert them to Unicode Code Points in Java, so that my client can understand the Unicode decimal html entity without the surrogate…

java surrogate-pairs

asked Mar 12 '20 at 15:53

Prashanth Tiramareddi

votes

3 answers

How to reverse strings that contain surrogate pairs in Dart?

I was playing with algorithms using Dart and as I actually followed TDD, I realized that my code has some limitations. I was trying to reverse strings as part of an interview problem, but I couldn't get the surrogate pairs correctly reversed. const…

string flutter unicode dart surrogate-pairs

asked Oct 13 '19 at 03:43

Vince Varga

6,101
6
43
60

votes

2 answers

Python unicode indexing shows different character

I have a Unicode string in a "narrow" build of Python 2.7.10 containing a Unicode character. I'm trying to use that Unicode character as a lookup in a dictionary, but when I index the string to get the last Unicode character, it returns a different…

python python-2.7 unicode ucs2 surrogate-pairs

asked Mar 20 '19 at 17:29

Tim

2,756
1
15
31

votes

1 answer

Python 2.7: Strange Unicode behavior

I am experiencing the following behavior in Python 2.7: >>> a1 = u'\U0001f04f' #1 >>> a2 = u'\ud83c\udc4f' #2 >>> a1 == a2 #3 False >>> a1.encode('utf8') == a2.encode('utf8') #4 True >>> a1.encode('utf8').decode('utf8') ==…

python unicode utf-8 character-encoding surrogate-pairs

asked Nov 04 '18 at 12:22

FireAphis

6,650
8
42
63

votes

1 answer

Convert from '\ud835' format to "" in c# [UWP]

I have a string with some wonky characters (for example) " ". I need to check if a List contains the first item in the string. But if I index it, it always becomes \ud835. After using Char.ConvertFromUtf32(\ud835) and some other attempts, I simply…

c# .net uwp surrogate-pairs

asked Aug 10 '18 at 23:07

Adam Dernis

votes

1 answer

Is String.Replace(string,string) Unicode Safe in regards to Surrogate Pairs?

I am trying to figure out the best way to create a function that is equivalent to String.Replace("oldValue","newValue"); that can handle surrogate pairs. My concern is that if there are surrogate pairs in the string and there is the possibility of a…

c# string unicode replace surrogate-pairs

asked May 04 '18 at 18:06

Ibrennan208

1,345
3
14
31

votes

1 answer

Unifying surrogate pairs in Japanese "dakuten" characters using R

I was trying to match a vector of Japanese strings (originally imported from a comma-separated file) with a list of filenames extracted from a folder under Mac OSX. One element from the vector is a: > a [1] "立ち上げる.mp3" The corresponding element…

r surrogate-pairs cjk

asked Nov 10 '17 at 04:19

Carl H

1,036
2
15
27

votes

0 answers

native2ascii doesn't honour surrogate pairs

For some reason, native2ascii ignores surrogate pairs when re-encoding escaped characters (\u????) back to Unicode: $ echo '\ud834\udd1e' | native2ascii -reverse -encoding UTF-16BE | hexdump -C 00000000 00 5c 00 75 00 64 00 38 00 33 00 34 00 5c 00…

java unicode surrogate-pairs

asked Aug 02 '17 at 09:14

Bass

4,977
2
36
82

votes

1 answer

High surrogate char always goes first (at lower index) within String?

1) Is high and low surrogate char order within String is fixed? Can I rely on it? Experimentally on Windows highSurrogate goes first into String (at lower index in terms of String.charAt(int index)). Is this always so on any Platform (Linux, etc)? …

java string unicode surrogate-pairs

asked May 07 '17 at 14:09

DiligentLearner

votes

2 answers

How to generate a random Unicode string including supplementary characters?

I'm working on some code for generating random strings. The resulting string appears to contain invalid char combinations. Specifically, I find high surrogates which are not followed by a low surrogate. Can anyone explain why this is happening? Do I…

java unicode surrogate-pairs

asked Dec 12 '16 at 21:28

Duncan Jones

67,400
29
193
254

votes

2 answers

xUnit.net: Why do these 2 equivalent tests have different results?

For some reason, this test utilizing InlineData fails in xUnit: [Theory] [InlineData("\uD800", 1)] public static void HasLength(string s, int length) { Assert.Equal(length, s.Length); } while this, which uses MemberData, passes: public static…

c# .net xunit xunit.net surrogate-pairs

asked Mar 19 '16 at 17:22

James Ko

32,215
30
128
239

votes

2 answers

Output UTF-16? A little stuck

I have some UTF-16 encoded characters in their surrogate pair form. I want to output those surrogate pairs as characters on the screen. Does anyone know how this is possible?

php utf-16 surrogate-pairs

asked Aug 17 '10 at 21:11

Jamie Redmond

votes

0 answers

Unity3d surrogate pairs emoji not appears

I am working in a unity project, and I am adding a chat module in it. I am facing a problem with emotions as it doesn't appear. I changed the .Net framework of unity to use microsoft .net framework and then used the code that solve the problem…

c# .net unity-game-engine utf-16 surrogate-pairs

asked Sep 19 '15 at 10:13

Mostafa Khattab

votes

2 answers

Are surrogate pairs the only way to represent code points larger than 2 bytes in UTF-16?

I know that this is probably a stupid question, but I need to be sure on this issue. So I need to know for example if a programming language says that its String type uses UTF-16 encoding, does that mean: it will use 2 bytes for code points in the…

unicode utf-16 codepoint surrogate-pairs

asked Dec 10 '14 at 08:54

user4344762

vote

1 answer

How to chunk a string containing characters with high code point?

I have this string : "397" The special character has the code point : 1114111 When I chunk the string : "397".match(/.{2}/g) I have this result : ['3\uDBFF', '\uDFFF9'] But I want this result : ['3', '97'] Thanks

javascript string surrogate-pairs

asked Apr 14 '23 at 09:43

mbourd

Prev 1 2 3

5 6 7 8 Next