Normalize ASCII to UTF-8 in R

Question

I have a dataframe I am trying to convert to rdf to edit in Protege. The dataframe unfortunately has ASCII codes that are not visible when the strings are printed, most notoriously \u0020, which its he code for a space.

x <- "\u0020". 
x
> " "

grepl() works fine when searching for the pattern, but does not return the original string when the result is printed.

match <- 
grep(pattern = "\u0020", x = x, value = TRUE)
match
> " "

The problem is that these codes are throwing Protege off and I'm trying to normalize them to basic characters such as \u0020 to " ", but I cannot find any regex that will catch these and replace them with the single non-code character. The regex pattern [^ -~] does not catch these values and I'm completely blind to these strings otherwise. How can I normalize any of these codes in R?

The string "\u0020" and the string " " are *identical* in R. The parser converts that Unicode escape to a space. There is no way to tell the difference looking at your variable `x`. The only way to discover such things is to look at the source code. — user2554330, Jul 21 '21 at 00:34
Perhaps worth a read [utf8 vignette](https://cran.r-project.org/web/packages/utf8/vignettes/utf8.html). — Chris, Jul 21 '21 at 04:29

score 1 · Answer 1 · answered Jul 22 '21 at 21:01

Personally, I would just replace all unicode in the file using the stringi library.

Given a CSV file, test.csv that looks like

col1,col2,col3
\u0020, moretext, evenmoretext

First load it as a data.frame

> frame <- read.csv("test.txt", encoding="UTF-8")
> frame
     col1      col2          col3
1 \\u0020  moretext  evenmoretext

Next, find all of the occurrences that you want to replace and use stri_unescape_unicode to turn it into something that Protege likes.

> frame$col1
[1] "\\u0020"
frame$col1 <- stri_unescape_unicode(frame$col1)
> frame$col1
[1] " "

Once replaced, you should be able to write your csv back to disk without the unicode entries.

Normalize ASCII to UTF-8 in R

1 Answers1