I hope my solution helps. First, as you probably know, the text U+041A is a hexadecimal code. I want to emphasize that, because I think is a bad idea to convert these codes to Cyrillic language. What I think is best, is to work with your text, through the hexadecimal Unicodes. In other words, have the unicodes of the letters in mind, not the letters per se, when working with the text.
This way, is gonna be easier to do regex, and other transformations in your text. When you want to read your text as Cyrillic, you just need to ask R, to interpret your vector of Unicodes, as UTF-8 text, through a function like intToUtf8()
.
The first thing you need to do, is to separate each Cyrillic word. So you want to detect each white space in your text, and them, substitute that space, by his respective unicode (yes, even white spaces have an unicode). After that, you need to separate each letter (before I was separating each word, now I want to separate each letter, or character that forms your phrase).
Next, I need to eliminate other metacharacters (> and +), and leave only the hexadecimal code, in each element of vector a
. After that, I just substitute each letter U, for a 0x, to isolate just the hexadecimal part of the Unicode. This way is easier, because for read the code U041A
as a Unicode, I need to insert a single backslash (resulting in \U041A
), before the U, and I was struggling to do that. After these steps, each element of vector a
, is a character (or a letter) that forms your phrase.
library(tibble)
library(stringr)
text <- "<U+041A><U+0440><U+0430><U+0433><U+0443><U+0458><U+0435><U+0432><U+0430><U+0446> <U+0410><U+0421>"
a <- str_replace_all(text, " ", replacement = "<U+0020>") # replace white spaces
a <- unlist(str_split(a, "[<]"))
a <- a[-1]
a <- str_replace_all(a, ">", "")
a <- str_replace_all(a, "\\+", "")
a <- str_replace_all(a, "U", "0x")
intToUtf8(a)
[1] "Крагујевац АС"