Questions tagged [surrogate-pairs]

Unicode characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called **surrogate pairs**.

Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

  • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;
  • the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;
  • the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
111 questions
2
votes
1 answer

How to Convert UTF-16 Surrogate Decimal to UNICODE in Java

I have some string data like � ;� ; These are surrogate pairs in UTF 16 in decimal format. How can I convert them to Unicode Code Points in Java, so that my client can understand the Unicode decimal html entity without the surrogate…
2
votes
3 answers

How to reverse strings that contain surrogate pairs in Dart?

I was playing with algorithms using Dart and as I actually followed TDD, I realized that my code has some limitations. I was trying to reverse strings as part of an interview problem, but I couldn't get the surrogate pairs correctly reversed. const…
Vince Varga
  • 6,101
  • 6
  • 43
  • 60
2
votes
2 answers

Python unicode indexing shows different character

I have a Unicode string in a "narrow" build of Python 2.7.10 containing a Unicode character. I'm trying to use that Unicode character as a lookup in a dictionary, but when I index the string to get the last Unicode character, it returns a different…
Tim
  • 2,756
  • 1
  • 15
  • 31
2
votes
1 answer

Python 2.7: Strange Unicode behavior

I am experiencing the following behavior in Python 2.7: >>> a1 = u'\U0001f04f' #1 >>> a2 = u'\ud83c\udc4f' #2 >>> a1 == a2 #3 False >>> a1.encode('utf8') == a2.encode('utf8') #4 True >>> a1.encode('utf8').decode('utf8') ==…
FireAphis
  • 6,650
  • 8
  • 42
  • 63
2
votes
1 answer

Convert from '\ud835' format to "" in c# [UWP]

I have a string with some wonky characters (for example) " ". I need to check if a List contains the first item in the string. But if I index it, it always becomes \ud835. After using Char.ConvertFromUtf32(\ud835) and some other attempts, I simply…
Adam Dernis
  • 530
  • 3
  • 14
2
votes
1 answer

Is String.Replace(string,string) Unicode Safe in regards to Surrogate Pairs?

I am trying to figure out the best way to create a function that is equivalent to String.Replace("oldValue","newValue"); that can handle surrogate pairs. My concern is that if there are surrogate pairs in the string and there is the possibility of a…
Ibrennan208
  • 1,345
  • 3
  • 14
  • 31
2
votes
1 answer

Unifying surrogate pairs in Japanese "dakuten" characters using R

I was trying to match a vector of Japanese strings (originally imported from a comma-separated file) with a list of filenames extracted from a folder under Mac OSX. One element from the vector is a: > a [1] "立ち上げる.mp3" The corresponding element…
Carl H
  • 1,036
  • 2
  • 15
  • 27
2
votes
0 answers

native2ascii doesn't honour surrogate pairs

For some reason, native2ascii ignores surrogate pairs when re-encoding escaped characters (\u????) back to Unicode: $ echo '\ud834\udd1e' | native2ascii -reverse -encoding UTF-16BE | hexdump -C 00000000 00 5c 00 75 00 64 00 38 00 33 00 34 00 5c 00…
Bass
  • 4,977
  • 2
  • 36
  • 82
2
votes
1 answer

High surrogate char always goes first (at lower index) within String?

1) Is high and low surrogate char order within String is fixed? Can I rely on it? Experimentally on Windows highSurrogate goes first into String (at lower index in terms of String.charAt(int index)). Is this always so on any Platform (Linux, etc)? …
2
votes
2 answers

How to generate a random Unicode string including supplementary characters?

I'm working on some code for generating random strings. The resulting string appears to contain invalid char combinations. Specifically, I find high surrogates which are not followed by a low surrogate. Can anyone explain why this is happening? Do I…
Duncan Jones
  • 67,400
  • 29
  • 193
  • 254
2
votes
2 answers

xUnit.net: Why do these 2 equivalent tests have different results?

For some reason, this test utilizing InlineData fails in xUnit: [Theory] [InlineData("\uD800", 1)] public static void HasLength(string s, int length) { Assert.Equal(length, s.Length); } while this, which uses MemberData, passes: public static…
James Ko
  • 32,215
  • 30
  • 128
  • 239
2
votes
2 answers

Output UTF-16? A little stuck

I have some UTF-16 encoded characters in their surrogate pair form. I want to output those surrogate pairs as characters on the screen. Does anyone know how this is possible?
Jamie Redmond
  • 63
  • 2
  • 6
2
votes
0 answers

Unity3d surrogate pairs emoji not appears

I am working in a unity project, and I am adding a chat module in it. I am facing a problem with emotions as it doesn't appear. I changed the .Net framework of unity to use microsoft .net framework and then used the code that solve the problem…
Mostafa Khattab
  • 554
  • 6
  • 18
2
votes
2 answers

Are surrogate pairs the only way to represent code points larger than 2 bytes in UTF-16?

I know that this is probably a stupid question, but I need to be sure on this issue. So I need to know for example if a programming language says that its String type uses UTF-16 encoding, does that mean: it will use 2 bytes for code points in the…
user4344762
1
vote
1 answer

How to chunk a string containing characters with high code point?

I have this string : "397" The special character has the code point : 1114111 When I chunk the string : "397".match(/.{2}/g) I have this result : ['3\uDBFF', '\uDFFF9'] But I want this result : ['3', '97'] Thanks
mbourd
  • 35
  • 4