2

I am struggling to translate utf-8 into ascii letters automatically.

In a data frame I have the following sequence which originates from greek letters:

<U+03A0><U+0391><U+039D><U+0391>G

By manually converting the sequence to

\u03A0\u0391\u039D\u0391G

I got the correct transscription by using stringi using this command:

t <- "\u03A0\u0391\u039D\u0391G" # original "ΠΑΝΑΓ"
t <- stri_trans_general(t,"any-latin")
t <- stri_trans_general(t,"latin-ascii")
print(t)
[1] "PANAG"

Now, I want to automate the translation via stringr using:

t2 <- "<U+03A0><U+0391><U+039D><U+0391>G"
t2 <- str_replace_all(t2,">","")
t2 <- str_replace_all(t2,"<U+","\\u") # double \\ for the escape character

The result is:

[1] "+03A0+0391+039D+0391G"

Which cannot be translated via stringi

May question to you is how to translate the original utf-8 string via stringr and stringi into ascii letters, as in my dataframe are a lot of those string sequences?

I am running RStudio Version 0.99.825 on R

R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale: [1] LC_COLLATE=German_Austria.1252

RStudio and R are running as portable apps.

Thank you in advance

Kind regards

Markus

1 Answers1

3

The reason that t2 <- str_replace_all(t2,"<U+","\u") doesn't work is because \u starts a unicode character and R expects the hex-code of the character after the \u. Therefore, you need to use \\\\u, which inserts the string "\u" (you need to escape twice: once for R and once for gsub). However, then you end up with the string "\u03A0" which is not the same as "\u03A0" typed in the console/sourced from file. The trick I used below is to parse the string.

There, should probably be a simpler way to do this but the following works:

 library(stringi)

str <- "<U+03A0><U+0391><U+039D><U+0391>G"

t <- gsub("<U\\+", "\\\\u", str)
t <- gsub(">", "", t)
t <- eval(parse(text=paste0('"', t, '"')))

t <- stri_trans_general(t,"any-latin")
stri_trans_general(t,"latin-ascii")
Jan van der Laan
  • 8,005
  • 1
  • 20
  • 35