Questions tagged [surrogate-pairs]

Unicode characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called **surrogate pairs**.

Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

  • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;
  • the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;
  • the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
111 questions
1
vote
1 answer

How to convert surrogate pairs into hexadecimal, and vice-versa in Python?

How would I convert characters which are surrogate pairs into hexadecimal? I've found that using hex() and ord() works for characters with a single code point, such as emojis like "". For example: print(hex(ord(""))) # '0x1f600' Similarly, using…
1
vote
0 answers

Can Windows' std::wcout display UTF-16 wchar_t surrogate pairs as a unicode character?

[LIVE] \U0001F34C\U0002008A on cout : \U0001F34C\U0002008A on wcout : wcsnlen : 4 1 : 0xd83c 2 : 0xdf4c 3 : 0xd840 4 : 0xdc8a wcsrtombs to char on cout : Even though UCRT makes MSVC more compliant with…
sandthorn
  • 2,770
  • 1
  • 15
  • 59
1
vote
1 answer

What unicode character (emoji) it was?

I have that string in my text file: ├░┬č┬Ź┬ć What is known is that it was emoji or at least some surrogate character/character created by javascript string of length 2 or 4 Because of some reason it end up in that form. (It was obtained from mysql…
ElSajko
  • 1,612
  • 3
  • 17
  • 37
1
vote
0 answers

How to convert surrogate pairs read from txt files back to emojis in python 3?

I have a few txt files to read where there are string such as: "Yes! Sardines in a can distancing! \uD83E\uDD23" Problem is that : when I'm doing "Yes! Sardines in a can distancing! \uD83E\uDD23".encode('utf-16','surrogatepass'…
1
vote
1 answer

Spliting an emoji sequence in powershell

I have a text box that will be filled with emoji only. No spaces or characters of any kind. I need to split these emoji in order to identify them. This is what I have tried: function emoji_to_unicode(){ foreach ($emoji in $textbox.Text) { …
birojow
  • 53
  • 7
1
vote
0 answers

Conversion of UTF16 to UTF32 - Invalid surrogate pair

While converting an array of UTF16 to UTF32, if I come across a high surrogate and if the next value is neither a high surrogate nor a low surrogate, should we invalidate both the values in UTF16 array? or Should we invalidate just the high…
Mounika
  • 371
  • 4
  • 18
1
vote
0 answers

How to Convert Surrogate Pairs in Servlet Response? How can I read Surrogate Pairs?

I have emoji as Surrogate Pairs received in the response. I want to change these Surrogates Pairs to a Unicode so that the WebSphere Portal 7 will understand the Unicodes. Added a filter to modify the response, to convert Surrogates to Unicode but…
1
vote
1 answer

Python Unicode - What Characters Can Be Printed in Windows Console?

Which Unicode characters can be printed in a Windows console from Python? The following code for code in range(1114112): print(chr(code), end=",") gives unimpressive results, including an error: UnicodeEncodeError: 'utf-8' codec can't encode…
Robin Andrews
  • 3,514
  • 11
  • 43
  • 111
1
vote
1 answer

Write surrogate pairs to file using Haskell

This is the code I have: import qualified System.IO as IO writeSurrogate :: IO () writeSurrogate = do IO.writeFile "/home/sibi/surrogate.txt" ['\xD800'] Executing the above code gives error: text-tests: /home/sibi/surrogate.txt: commitBuffer:…
Sibi
  • 47,472
  • 16
  • 95
  • 163
1
vote
1 answer

What is a Unicode safe replica of String.IndexOf(string input) that can handle Surrogate Pairs?

I am trying to figure out an equivalent to C# string.IndexOf(string) that can handle surrogate pairs in Unicode characters. I am able to get the index when only comparing single characters, like in the code below: public static int…
Ibrennan208
  • 1,345
  • 3
  • 14
  • 31
1
vote
1 answer

How to reveal the surrogate pairs in String in perl

I am working a perl code base to validate customer input, my goal is to block surrogate characters. My thought is first encoding the customer input as UTF-16 and foreach my $messageChar (@MessageChars) { my $messageCharUTF16 =…
Dengke Liu
  • 39
  • 7
1
vote
1 answer

Inserting a surrogate pair into MySQL with an INSERT statement

I'm trying to insert a surrogate pair ('', \uD852\uDF62, the same as U+24B62 from this example) into MySQL. An INSERT with an unescaped literal, suggested by this answer: INSERT INTO unicode_test (value) VALUES (''); -- or INSERT INTO unicode_test…
Bass
  • 4,977
  • 2
  • 36
  • 82
1
vote
2 answers

Get last character of string in current modern Javascript, allowing for Astral characters such as Emoji that use surrogate pairs (two code units)

Unicode characters (code points) not in the Basic Multilingual Plane (BMP) may consist of two chars (code units), called a surrogate pair. 'ab' is two code units and two code points. (So two chars and two characters.) 'a' is three code units and two…
hippietrail
  • 15,848
  • 18
  • 99
  • 158
1
vote
1 answer

Eclipse IDE processing emojis using surrogate pairs

I am not able to find a clear answer to this. Does the ECLIPSE IDE support emojis? I have read a lot about surrogate pairs here on stack overflow, but I am unable to get a clear answer on this. I am having to read in a text file character by…
Wanda
  • 25
  • 3
1
vote
1 answer

How to Convert UTF-16 Surrogate Decimal to UNICODE in C++

I got some string data from parameter such as ��. These are Unicode's UTF-16 surrogate pairs represented as decimal. How can I convert them to Unicode code points such as "U+1F62C" with the standard library?