3

I have a text file which contains some kind of fallback conversions of Unicode characters (the Unicode code points in angle brackets). So it contains e.g. foo<U+017E>bar which should be "foošbar". Is there an easy way in R to convert the whole file to UTF8 with these characters converted? Unfortunately I am on Windows and can't find a supported UTF-8 locale.

user43018
  • 63
  • 5
  • UTF8 is an encoding, *NOT* a locale. Anyway, Windows uses Unicode natively since 2000 at least. R packages though mix up Unicode and ANSI code, then depend on changing localization settings to handle what is an encoding issue. What did you actually try? Different packages have different quirks. Some of them unfortunately confuse language and encoding – Panagiotis Kanavos Oct 04 '16 at 10:00
  • What are the *file's* encoding and contents? Does it use one of the Unicode encodings? Then it could contain `foošbar` without any conversion issues. Are you sure the problem isn't RStudio's or RRO's display font? – Panagiotis Kanavos Oct 04 '16 at 10:09
  • my problem is that I can't switch to a UTF-8-friendly locale on Windows; things like `Sys.setlocale("LC_ALL", 'en_US.UTF-8')` don't work, don't know why. So I have this problem whatever encoding the file is. – user43018 Oct 04 '16 at 10:46
  • *Locales* have to do with countries, not Unicode encodings. The `Sys.setlocale` is actually an R workaround to allow ANSI-compiled packages to work with Unicode data - as long as they don't try to inspect the values. I have no problem entering or loading `foošbar` from a file for example. Some packages though fail to work with the loaded text while others have no problems. Some even mix Unicode and ANSI code – Panagiotis Kanavos Oct 04 '16 at 10:54
  • What *is* the code that shows the problem? Replacing strings is just a workaround. – Panagiotis Kanavos Oct 04 '16 at 10:57

2 Answers2

5

Perhaps:

library(stringi)
library(magrittr)

"foo<U+0161>bar and cra<U+017E>y" %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]+)>", "\\\\u$1") %>% 
  stri_unescape_unicode() %>% 
  stri_enc_toutf8()
## [1] "foošbar and cražy"

may work (I don't need the last conversion on macOS but you may on Windows).

hrbrmstr
  • 77,368
  • 11
  • 139
  • 205
  • No need for conversions, Windows uses Unicode natively. R packages on the other hand mix up Unicode and ANSI code a lot. To make matters worse, many R packages don't recognize encodings but try to guess from the system's locale or language. Which makes matters interesting when trying to read multiple Unicode encodings, or even multiple date and number formats – Panagiotis Kanavos Oct 04 '16 at 10:02
2

The previous answer should work when the code point is presented with exactly four digits. Here is a modified version that should work for any number of digits between 1 and 8.

library(stringi)
library(magrittr)

"foo<U+0161>bar and cra<U+017E>y, Phoenician letter alf <U+10900>" %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{4})>", "\\\\u$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{5})>", "\\\\U000$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{6})>", "\\\\U00$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{7})>", "\\\\U0$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{8})>", "\\\\U$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{1})>", "\\\\u000$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{2})>", "\\\\u00$1") %>% 
  stri_replace_all_regex("<U\\+([[:alnum:]]{3})>", "\\\\u0$1") %>% 
  stri_unescape_unicode() %>% 
  stri_enc_toutf8()
## [1] "foošbar and cražy, Phoenician letter alf "
mvkorpel
  • 526
  • 6
  • 10
  • But shouldn't the previous answer also work with any number of digits? I mean [[:alnum:]] means any alpha-numeric character, and the + means one or more times. (Of course maybe it was edited after you answered...) – Benjamin Aug 29 '19 at 16:36
  • The lowercase escape code `\u` accepts up to four digits. For example, the solution presented in the other answer would fail to process `""` correctly, converting it to `\u102A0`, i.e., the character `` followed by a literal zero. Also, if a codepoint was exceptionally not zero-padded to four digits, problems would ensue: `stri_unescape_unicode()` requires that `\u` is followed by four digits (and `\U` by eight digits). – mvkorpel Aug 30 '19 at 18:12
  • 1
    Damn, I see what you're saying! The regex itself picks everything up, but what it replaces the pattern with is not consumable in this context. Even though R can take \U with "up to 8" characters (if I print `"\U102A0"`, R spits out `"\U000102a0"`), `stri_unescape_unicode` is very strict: it takes either `\u1234` or `\U12345678`. `\U102A0` throws an error. I'm glad I asked; thank you! – Benjamin Aug 30 '19 at 22:29
  • That is to say, I also can't just replace the original solution with `stri_replace_all_regex("", "\\\\U$1")`. – Benjamin Aug 30 '19 at 22:32