Questions tagged [surrogate-pairs]

Unicode characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called **surrogate pairs**.

Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

  • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;
  • the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;
  • the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.
111 questions
1
vote
1 answer

How does "Surrogate Pair" concept works in a database?

My question relates to databases (and in particular SQL Server): in the official guide, it is mentioned that when using "NVARCHAR/NCHAR", "2 bytes of storage per character" is used and "if a surrogate pair is needed, a character will require 4 bytes…
LearnByReading
  • 1,813
  • 4
  • 21
  • 43
1
vote
1 answer

Determine all ISO 15924 script codes in JavaScript string

I'm looking for an efficient way to take a JavaScript string and return all of the scripts which occur in that string. Full UTF-16 including the "astral" plane / non-BMP characters which require surrogate pairs must be correctly handled. This is…
hippietrail
  • 15,848
  • 18
  • 99
  • 158
0
votes
0 answers

python json5 and json package inconsistent deal with surrogate pair

#!/opt/homebrew/bin/python3 #-*-coding:utf8-*- #python version 3.11.4 import json5 import json d = json5.loads('{"\\ud83d\\ude03": "ok"}') print(d) # output: {'\ud83d\ude03': 'ok'} d = json.loads('{"\\ud83d\\ude03": "ok"}') print(d) # output:…
Green
  • 121
  • 1
  • 3
0
votes
0 answers

Python MySQL doesn't encode surrogates for query parameters

Running tried this with both Python3.7 and Python3.8, with mysql-connector-python 8.0.13 and 8.1.0 MySQL 5.7.42 Collation on the database is set to 'utf8mb4_unicode_520_ci' Connection from Python is: db = None db = mysql.connector.connect( …
0
votes
0 answers

Why does a keyboard-entered super Unicode string change after being passed to a method?

Why is the string "Hello" entered by Scanner(System.in) printed incorrectly after calling func(str2)? The java code is as follows: package test; import java.util.Scanner; public class TestCodePoint { private static void func(String str) { …
Ken.Zhang
  • 1
  • 1
0
votes
3 answers

XSL Transformation, emoji and attributes

I'm encountering issue with emojis when trying to generate html output using xsl transformation under certain circumstances. For instance, I've tested following xsl with different transformation engines:
morbac
  • 301
  • 4
  • 16
0
votes
1 answer

Unicode conversion in XML failed

In response of a webservice call, I am getting an XML
keku keku
  • 63
  • 1
  • 8
0
votes
0 answers

Bulding Python 3 command from bash gives UnicodeEncodeError: 'utf-8' codec can't encode characters in position surrogates not allowed

This is a silly example of the problem, I have a bash variable I am building inside the string to run with python -c "$string", but there are some characters as which are breaking my program. Running this: #!/bin/bash set -x var='some …
Evandro Coan
  • 8,560
  • 11
  • 83
  • 144
0
votes
1 answer

XML marshalling of surrogate pairs

I have came across a strange behavior of marshaller when surrogate pairs are involved. Why JAXB marshaller adds unnecessary (and invalid) XML entity? When I try to marshall the following: \uD83D\uDCB3, e.g. 55357 56499 code points Mashaller…
Giron
  • 350
  • 3
  • 16
0
votes
2 answers

python function to transform data to JSON

Can I check how do we convert the below to a dictionary? code.py message = event['Records'][0]['Sns']['Message'] print(message) # this gives the below and the type is { "created_at":"Sat Jun 26 12:25:21 +0000 2021", …
Adam
  • 1,157
  • 6
  • 19
  • 40
0
votes
1 answer

Why does the UTF-16 bytes for emoji smiley and emoji flag together looks different than a sequence of their individual UTF-16 bytes?

Following is from the Visual Studio's C# Interactive Compiler: > BitConverter.ToString(Encoding.BigEndianUnicode.GetBytes("")) "D8-3D-DE-00" > BitConverter.ToString(Encoding.BigEndianUnicode.GetBytes("")) "D8-3C-DF-F4" >…
0
votes
1 answer

How to print surrogate chars as ints in Java

I have this: char c = "\ud804\udef4".charAt(0); char d = "\ud804\udef4".charAt(1); How will I print c and d as hex Strings ? I want d804 for c and def4 for d.
Ankur Agarwal
  • 23,692
  • 41
  • 137
  • 208
0
votes
0 answers

How to detect if string contains any supplementary characters in PHP?

From what I understand so far, supplementary characters (or "surrogate pairs") are defined in a range from 0xd800 to 0xdbff for the first char, and from 0xdc00 and 0xdfff for the second char. So I'm trying to detect if an arbitrary string contains…
c00000fd
  • 20,994
  • 29
  • 177
  • 400
0
votes
1 answer

Converting surrogate pairs to emoji - python3

I found the solution to similar question on the other topic, but unfortunately it's not working for me. Here is my problem: I'm making dataframe from the surrogatepairs unicodes which I'd like to search for in another file (example: "\uD83C\uDFF3",…
maliniaki
  • 47
  • 9
0
votes
0 answers

How to encode unicode surrogate pairs, to write to file

I'm creating an compression / encryption program that utilises (nearly) all unicode characters, and then writes the data to a file. However, to write to the file, I need to encode the characters to bytes. However, when I do so, It gives this…
user11479704