Questions tagged [surrogate-pairs]

Unicode characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called **surrogate pairs**.

Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

  • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;
  • the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;
  • the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
111 questions
0
votes
1 answer

Replacing emoji from a string

I am new to vb net. i am trying to process a string containing emoji but I cannot do it. my string looks like this: I want to replace with what i am doing is using string.remove and string.add but I am getting surrogate pair error. …
sebastian
  • 1
  • 1
0
votes
1 answer

Should StringIO(HighSurrogate) throw an error in Python 2.7?

When I run this Python 2.7 code (edit: updated the code) import io x = io.StringIO(u'\ud801') CPython runs fine, but IronPython throws the following error: UnicodeEncodeError: Unable to translate Unicode character \uD801 at index 0 to specified…
user541686
  • 205,094
  • 128
  • 528
  • 886
0
votes
0 answers

get unicode graphemes as unsplitted item with python2.7

any idea, if it is possible with regex (python 2.7) to get uniq chars unspitted into surrogate pairs for unicode graphemes? According This Example this is possible with python 3.x. See here: >>> import regex >>> s = '‍‍‍' >>> for c in…
0
votes
1 answer

PYTHON RE Dont split UNICODE Chars into surrogate pairs while matching

who know, if it is possible to forbidden regex while macthing splitting code points into surrogate pairs. See the following example: How it is now: $ te = u'\U0001f600\U0001f600' $ flags1 = regex.findall(".", te, re.UNICODE) $ flags1 >>>…
Egor Savin
  • 39
  • 7
0
votes
0 answers

How to extract a list of all 18 character entries after specific phrase in string using RegExr?

I managed to extract a list of the text within square brackets within an emoji list I have here: https://regexr.com/3sqk1 But now I need to extract the equivalent decimalSurrogateHtml pairs for each emoji (I know a few of them have 2 pairs but would…
deeve
  • 113
  • 6
0
votes
0 answers

UnicodeEncodeError: 'utf-8' codec can't encode character '\udc43' in position 1: surrogates not allowed

I have a list containing placenames and I want to create another array, initially empty, and then iterate the list of placenames and fill up my empty array with these placenames. For example, my first location is 'CHARTRIDGE' and accessing this…
Mr Moose
  • 3
  • 1
  • 5
0
votes
0 answers

Unicode surrogates and combinig characters

I'm thinking about using UTF-16 in an application. But I have some difficulties in understanding some key concepts. In particular the surrogates and combinig characters. As I understand the surrogates are used for UTF-16 to allow encoding of…
woodtluk
  • 935
  • 8
  • 20
0
votes
0 answers

Trying to use surrogate pairs

I am trying to display a playing card using Unicode in Java/Android Studio. the Unicode for the card is U+1F0A1 which I understand can't be used and must be converted to surrogate pairs. the code I have entered is public String…
Heather
  • 41
  • 2
  • 5
0
votes
0 answers

Surrogate pairs cannot display on form

I'm trying to design a virtual keyboard with character of any language. Everything goes fine, except one point: codes beyond 0xFFFF. I'm using surrogate pairs for codes beyond 0xFFFF, like this: Dim codeH As Integer Dim codeL As Integer If thisCode…
C. MARIN
  • 95
  • 9
0
votes
2 answers

How to convert between a Unicode/UCS codepoint and a UTF16 surrogate pair?

How to convert back and forth between a Unicode/UCS codepoint and a UTF16 surrogate pair in C++14 and later? EDIT: Removed mention of UCS-2 surrogates, as there is no such thing. Thanks @remy-lebeau!
jotik
  • 17,044
  • 13
  • 58
  • 123
0
votes
2 answers

c++: How to support surrogate characters in utf8

We have an application that is written utf-8 base encoding and this supports the utf-8 BMP (3-bytes). However, there is a requirement where it needs to support Surrogate pairs. I have read somewhere that Surrogate characters are not supported in…
0
votes
0 answers

Java Xml Transformation escapes surrogates code units that represents supplementary characters

I am executing a web application in the container of servlets Tomcat 8.0. In a request i try transforming an input data, to XML with code below. The firts input data character is a unicode supplementary character U+16980 represented as the char pair…
0
votes
1 answer

Unicode surrogate pairs

Say I have a surrogate pair. For example: \u306f\u30fc Is there a function I can use to print the character to the screen?
Jamie Redmond
  • 731
  • 2
  • 12
  • 14
0
votes
1 answer

Weka: How can I implement a Surrogate Split in J48 Decision Tree?

Can anybody help me to implement an alternative missing value handling in J48 algorithm using Weka API in Java. I am sure that using pre-imputation approaches before training the J48 is easy. But what is about using a surrogate split attribute in…
user3770188
0
votes
1 answer

Getting the cursor position in RichEdit when text has surrogates

On Windows, if you have a UTF-16 sequence containing surrogate and that you insert that sequence in a RichEdit control, the RichEdit control handles this well and for each surrogate pair, it will only show one character. The difficulty I'm facing is…
Emmanuel Stapf
  • 213
  • 1
  • 7