1

Table of contents

  • The context
  • The problem
  • The question

The context

In the context of R, I'm aware that stringi::stri_unescape_unicode() could be used for converting a Unicode code to its corresponding character.

For example, the Unicode code for á (LATIN SMALL LETTER A WITH ACUTE) and is U+00E1 and U+597D, respectively. This means that I can insert those character by executing the following.

library(stringi)

stringi::stri_unescape_6unicode("\\u00E1")
stringi::stri_unescape_unicode("\\u597D")
[1] "á"
[1] "好"

I'm also aware that characters in the following ranges are for private use. The following quote was retrieved fromd this glossary (archive) in https://unicode.org.

Private-Use Code Point. Code points in the ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. (See definition D49 in Section 3.5, Properties.) These code points are designated in the Unicode Standard for private use.

As you can read in the quote, there are three ranges. The following lists those characters that are the limits of those ranges.

  • First range:  (U+E000)
  • First range:  (U+F8FF)
  • Second range: (U+F0000)
  • Second range: (U+FFFFD)
  • Third range: (U+100000)
  • Third range: (U+10FFFD)

The problem

When I try to print the characters in the in the list above that belong to the first range (i.e.  (U+E000) and  (U+F8FF)), there's no problem.

stringi::stri_unescape_unicode("\\ue000")
stringi::stri_unescape_unicode("\\uf8ff")
[1] ""
[1] ""

However, when I try to print the characters in shown in the list above that belong to the second range (i.e. (U+F0000) and (U+FFFFD)), R doesn't return those characters.

stringi::stri_unescape_unicode("\\uf0000")
stringi::stri_unescape_unicode("\\uffffd")
[1] "0"
[1] "\uffffd"

Similarly, the following doesn't print the characters shown in the list above that belong in the third range (i.e. (U+10FFFD) and (U+100000))

stringi::stri_unescape_unicode("\\u100000")
stringi::stri_unescape_unicode("\\u10fffd")
[1] "က00"
[1] "ჿfd"

The question

  1. Why isn't stringi::stri_unescape_unicode() able to display characters that belong to the ranges U+F0000..U+FFFFD or U+100000..U+10FFFD?

  2. Is there any function in R that is able to return those characters?

rdrg109
  • 265
  • 1
  • 8
  • This question seems to be related: https://stackoverflow.com/questions/41541138 – rdrg109 Nov 18 '22 at 23:50
  • `stringi::stri_unescape_unicode("\\ud83d\\udc31")` as well as `stringi::stri_unescape_unicode("\\U0001F431")` return `[1] ""` ( character `` (U+1F431, *CAT FACE (0xd83d,0xdc31)*)). For characters from private use ranges you need a font which can render them… Note `\\U0001F431` (8 hexa ciphers) above BMP (_astral planes_): `stringi::stri_unescape_unicode("\\u1F431")` returns `[1] "ὃ1"` (two characters `ὃ` (U+1F43, *Greek Small Letter Omicron With Dasia And Varia*) and `1` (U+0031, *Digit One*)). – JosefZ Nov 19 '22 at 17:33

0 Answers0