find_replace nul character in R

Question

the only thing that seams to be the closest to my problem is: are-there-raw-strings-in-r However this does not help me enough.

The problem

I have a Windows-like formatted paths in a data frame:

data.frame(path = c("X:\01_aim\01_seq.R", "X:\01_aim\02_seq.R", "X:\01_aim\03_seq.R"), 
           dat = c("data1.csv", "data2.csv", "data1.csv"))

The aim is to convert the paths into Unix like path, thus I need an output like:

data.frame(path = c("/01_aim/01_seq.R", "/01_aim/02_seq.R", "/01_aim/03_seq.R"), 
           dat = c("data1.csv", "data2.csv", "data1.csv"))

My approach

An approach to manipulate paths you see above generates the following error:

> sub("\0", "##", "X:\01_aim\01_seq.R")
# Error: nul character not allowed (line 1)

What I found already is the way to print the path using r"()" formatting option, which gives:

> r"(X:\01_aim\01_seq.R)"
[1] "X:\01_aim\01_seq.R"

With that my final solution would be close to:

tmp_path <- str_replace_all(string = r"(X:\01_aim\01_seq.R)",      
    pattern = r"(\\)", 
    replacement =  "/")
str_replace_all(tmp_path, r"(X:)", "")
[1] "/01_aim/01_seq.R"

but what I lack is how to force the r"( )" formatting of a string on a given string in a variable. Specifically, when I have a function:

convert.path <- function(my.path){
   # how can I force the variable my.path to be stored as r"(`my.path`)"
   # so that I can insert the above code here.
   my.path.raw <- to.r.brackets(my.path)
   tmp_path <- str_replace_all(my.path.raw, pattern = r"(\\)", replacement =  "/")
   str_replace_all(tmp_path, r"(X:)", "")
}

I wanted to force re-formatting in place of comments. Does anyone have an idea how to make this trick?

Your `my.path` should already contain the correct string (= *text*). Else, there is no other way. Unless there is some scenario you have not explained. — Wiktor Stribiżew, Aug 23 '21 at 11:45
Perhaps you could split your path and use R's `file.path` function? — Martin Gal, Aug 23 '21 at 11:52
Your premise is wrong. There is no difference between how `r"( )"` strings are stored versus other strings. The `r"( )"` format is simply a way to specify a string in code. It uses different input rules than the usual `" "` strings, but what it produces and stores is indistinguishable from other strings. — user2554330, Aug 23 '21 at 12:06
@user2554330 Ok, so basically there is no work around for this code to work: `my.path <- "X:\01_aim\01_seq.R", sub("\0", "##", my.path)` ? — storaged, Aug 23 '21 at 12:13
That's not legal code. In regular string code, `"\0"` means the null character, not a backslash followed by a zero, and nulls aren't allowed in R strings. To code your path you should use `"X:\\01_aim\\01_seq.R"`. In `sub()`, things are even worse, because you need a double backslash to match a backslash, and you need `"\\\\"` to code for two backslashes. So the `sub()` should be `sub("\\\\0", "##", my.path)`. — user2554330, Aug 23 '21 at 14:27

TimTeaFan · Accepted Answer · 2021-08-23T20:55:25.403

One way is to use gsub() within eval(parse(text = ...)):

dat <- data.frame(path = c("X:\01_aim\01_seq.R", "X:\01_aim\02_seq.R", "X:\01_aim\03_seq.R", "X:\01_aim\04_seq.R"), 
                  dat = c("data1.csv", "data2.csv", "data1.csv", "data2.csv"))

temp <- eval(parse(text= gsub("\\", "/", deparse(dat$path), fixed=TRUE)))
gsub("X:", "", temp)

#> [1] "/001_aim/001_seq.R" "/001_aim/002_seq.R" "/001_aim/003_seq.R"
#> [4] "/001_aim/004_seq.R"

^{Created on 2021-08-23 by the reprex package (v2.0.1)}

Another way is to escape the strings containing one backslash using stringi::stri_escape_unicode. Since the string is converted to unicode before being escaped this adds an unwanted u0 after each pair of backslashs. We can then use gsub("\\\\u0", "/") to get the desired file path.

dat <- data.frame(path = c("X:\01_aim\01_seq.R", "X:\01_aim\02_seq.R", "X:\01_aim\03_seq.R"), 
           dat = c("data1.csv", "data2.csv", "data1.csv"))


temp <- gsub("X:", "", stringi::stri_escape_unicode(dat$path))
gsub("\\\\u0", "/", temp)
#> [1] "/001_aim/001_seq.R" "/001_aim/002_seq.R" "/001_aim/003_seq.R"

^{Created on 2021-08-23 by the reprex package (v2.0.1)}

Dear @TimTeaFan do you see any way to improve for the case of: `"X:\000_aim\00_seq.R"`? There is still an error `Error: nul character not allowed (line 1)` because of `\000` I guess... EDIT: And also `"X:\A00_aim\00_seq.R"` fails. — storaged, Aug 23 '21 at 16:43
@storaged: I have added another approach, maybe this will work with `"X:\A00_aim\00_seq.R"`. However, since this R code is non-legal I'm not even able to get it into a `data.frame`. I wonder where these strings come from? If you read in a csv R will automatically escape "X:\A00_aim\00_seq.R" as `"X:\\A00_aim\\00_seq.R"`. Where do you get those strings from? — TimTeaFan, Aug 23 '21 at 20:57

find_replace nul character in R

1 Answers1