In order to change values in results data frame, I use a stringr
-based function, recommended in an answer by Hadley Wickham (https://stackoverflow.com/a/12829731/2872891). I left the function intact with the exception of changing df
in the end to return (df)
, which I like better. However, I see some strange behavior and I'm not sure what is the reason for it. The subsequent calls of replace_all
, in particular, calls #3 and #4 do not recover the original data: http:
and mailto:
. A reproducible example follows.
Data (just one record of data):
Please see this Gist on GitHub: https://gist.github.com/abnova/1709b1e0cf8a57570bd1#file-gistfile1-r
Code (for brevity, I removed my comments with detailed explainations):
DATA_SEP <- ":"
rx <- "([[:alpha:]][^.:]|[[:blank:]])::([[:alpha:]][^:]|[[:blank:]])"
results <- gsub(rx, "\\1@@\\2", results)
results <- gsub(": ", "!@#", results) # should be after the ::-gsub
results <- gsub("http://", "http//", results)
results <- gsub("mailto:", "mailto@", results)
results <- gsub("-\\r\\n", "-", results) # order is important here
results <- gsub("\\r\\n", " ", results)
results <- gsub("\\n:gpl:962356288", ":gpl:962356288", results)
results <- readLines(textConnection(unlist(results)))
numLines <- length(results)
results <- lapply(results, function(x) gsub(".$", "", x))
data <- read.table(textConnection(unlist(results)),
header = FALSE, fill = TRUE,
sep = DATA_SEP, quote = "",
colClasses = "character", row.names = NULL,
nrows = numLines, comment.char = "",
strip.white = TRUE)
replace_all(data, fixed("!@#"), ": ")
replace_all(data, fixed("@@"), "::")
replace_all(data, fixed("http//"), "http://")
replace_all(data, fixed("mailto@"), "mailto:")
Results - actual:
> data$V3
[1] "http//www.accessgrid.org/"
> data$V17
[1] "http//mailto@accessgrid-tech@lists.sourceforge.net"
Results - expected:
> data$V3
[1] "http://www.accessgrid.org/"
> data$V17
[1] "http://mailto:accessgrid-tech@lists.sourceforge.net"
I'd appreciate any help and/or advice.