1

In order to change values in results data frame, I use a stringr-based function, recommended in an answer by Hadley Wickham (https://stackoverflow.com/a/12829731/2872891). I left the function intact with the exception of changing df in the end to return (df), which I like better. However, I see some strange behavior and I'm not sure what is the reason for it. The subsequent calls of replace_all, in particular, calls #3 and #4 do not recover the original data: http: and mailto:. A reproducible example follows.

Data (just one record of data):

Please see this Gist on GitHub: https://gist.github.com/abnova/1709b1e0cf8a57570bd1#file-gistfile1-r

Code (for brevity, I removed my comments with detailed explainations):

DATA_SEP <- ":"

rx <- "([[:alpha:]][^.:]|[[:blank:]])::([[:alpha:]][^:]|[[:blank:]])"
results <- gsub(rx, "\\1@@\\2", results)
results <- gsub(": ", "!@#", results) # should be after the ::-gsub
results <- gsub("http://", "http//", results)
results <- gsub("mailto:", "mailto@", results)

results <- gsub("-\\r\\n", "-", results) # order is important here
results <- gsub("\\r\\n", " ", results)

results <- gsub("\\n:gpl:962356288", ":gpl:962356288", results)

results <- readLines(textConnection(unlist(results)))
numLines <- length(results)
results <- lapply(results, function(x) gsub(".$", "", x))

data <- read.table(textConnection(unlist(results)),
                   header = FALSE, fill = TRUE,
                   sep = DATA_SEP, quote = "",
                   colClasses = "character", row.names = NULL,
                   nrows = numLines, comment.char = "",
                   strip.white = TRUE)

replace_all(data, fixed("!@#"), ": ")
replace_all(data, fixed("@@"), "::")
replace_all(data, fixed("http//"), "http://")
replace_all(data, fixed("mailto@"), "mailto:")

Results - actual:

> data$V3
[1] "http//www.accessgrid.org/"
> data$V17
[1] "http//mailto@accessgrid-tech@lists.sourceforge.net"

Results - expected:

> data$V3
[1] "http://www.accessgrid.org/"
> data$V17
[1] "http://mailto:accessgrid-tech@lists.sourceforge.net"

I'd appreciate any help and/or advice.

Community
  • 1
  • 1
Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64
  • `fixed` matches your pattern as a regular string, try removing that from your replace_all and see what happens. I don't have a compiler nearby. – hwnd May 12 '14 at 12:31
  • @hwnd: I used `fixed` intentionally as I wanted to search and replace a fixed string. But, just in case, I'll give it a try and report. Thanks! – Aleksandr Blekh May 12 '14 at 12:36
  • @hwnd: I just tried your suggestion. Unfortunately, it didn't help. The result is the same as with `fixed`. – Aleksandr Blekh May 12 '14 at 12:41

2 Answers2

2

I tested this and found an issue with the replacement using multiple calls to replace_all back to back.

replace_all(data, fixed("!@#"), ": ")
replace_all(data, fixed("@@"), "::")
replace_all(data, fixed("http//"), "http://")
replace_all(data, fixed("mailto@"), "mailto:")

The reason you are not seeing the expected output is because you are not assigning the result of the replace_all calls to anything afterwards. It should be..

data <- replace_all(data, fixed("!@#"), ": ")
data <- replace_all(data, fixed("@@"), "::")
data <- replace_all(data, fixed("http//"), "http://")
data <- replace_all(data, fixed("mailto@"), "mailto:")
data

Another way to do this without using stringr would be to create vectors that contain your pattern and replacements and loop through them with one call for the replacement.

re  <- c('!@#', '@@', 'http//', 'mailto@')
val <- c(': ',  '::', 'http://', 'mailto:')

replace_all <- function(pattern, repl, x) {
    for (i in 1:length(pattern))
       x <- gsub(pattern[i], repl[i], x, fixed=T)
       x
}
replace_all(re, val, data)

Output

[3] "http://www.accessgrid.org/"
[17] "http://mailto:accessgrid-tech@lists.sourceforge.net"   
hwnd
  • 69,796
  • 4
  • 95
  • 132
  • I appreciate your help! This looks very good and might be useful. While I use `stringr` package in other parts of my code, I might be able to not use it in this place (thought I doubt that I found a bug). Nevertheless, I still need to use `replace_all` function, modified to use `gsub` and such, as I need to perform search and replace throughout all eligible fields of the results data frame. – Aleksandr Blekh May 12 '14 at 13:11
  • I was offline until now, just saw your update. Thank you so much for your help! My testing confirms your description (fails when calling back to back). I will switch this code to using `gsub`, but I hope that @hadley will be able to comment on this behavior of `stringr` package and, if this is a defect, acknowledge and fix it. – Aleksandr Blekh May 12 '14 at 23:00
  • @AleksandrBlekh I retested the code and figured that out last night, I updated my answer. – hwnd May 13 '14 at 13:47
  • Thank you! It's interesting that we've come to the same conclusion independently pretty much simultaneously, give or take few hours. That's rather surprising, considering my significant break in active software development. But, I'm back now. – Aleksandr Blekh May 13 '14 at 20:32
0

After almost having finished the alternative (gsub-based) implementation, suggested by @hwnd, I realized what was the problem with my original code. I quickly tested the fixed code and it confirmed my thoughts. I simply needed, for each subsequent replace_str call, to re-save the result, returned by each previous call. Therefore, the fixed code looks like this:

# Now we can safely do post-processing, recovering original data
data <- replace_all(data, fixed("!@#"), ": ")
data <- replace_all(data, fixed("@@"), "::")
data <- replace_all(data, fixed("http//"), "http://")
data <- replace_all(data, fixed("mailto@"), "mailto:")

Again, thanks to @hwnd for valuable suggestions, which helped me to figure out this issue.

Aleksandr Blekh
  • 2,462
  • 4
  • 32
  • 64