0

For the life of me, I am unable to strip out some escape characters from a text string (prior to further processing). I've tried stringi, gsub, but I just cannot get the correct syntax.

Here is my text string

txt <- "c(\"\\r\\n    Stuff from a webpage: That I scraped using webcrawler\\r\\n\", \"\\r\\n        \", \"\\r\\n        \", \"\\r\\n        \", \"\\r\\n\\r\\n        \", \"\\r\\n\\r\\n        \", \"\\r\\n        \\r\\n    \", \"\\r\\n    \")"

I'd like to strip out "\\r\\n" from this string.

I've tried

gsub("[\\\r\\\n]", "", txt)  (leaves me with "rn")
gsub("[\\r\\n]", "", txt)    (leaves me without ANY r or n in the text)
gsub("[\r\n]", "", txt)      (strips nothing)

How can I remove these characters? Bear in mind that this will need to work over other entries that may have normal words ending in "rn" or have "rn" in the middle somewhere!

Thanks!

Jon
  • 445
  • 3
  • 15
  • Try `gsub('[\\\\r\\\\n]', '', txt)` – akrun Jul 17 '18 at 15:11
  • Instead of replacing newlines after the fact modify the code that *reads* the text to recognize `\r\n` as a separator. BTW newlines in a web page are meaningless, they are treated as whitespace. Only *tags* affect how the text is displayed – Panagiotis Kanavos Jul 17 '18 at 15:11
  • The gsub suggested here also strips plain r and n characters. re: Panagiotis, I'm using Rcrawler, and I'm not familiar with how to exclude these characters (if even possible). – Jon Jul 17 '18 at 15:15
  • 1
    Try `gsub("\\\\n", "", gsub("\\\\r", "", txt))` if you want to remove all of either \\r or \\n, or try `gsub("\\\\r\\\\n", "", txt)` if you want to remove just the ones where it is \\r\\n. – Kerry Jackson Jul 17 '18 at 15:31

2 Answers2

1

At the risk of answering my own question too quickly, I've found a bodge workaround which simply involves switching out the "\" for a rare place holder, "__", then replacing that:

gsub('__r__n', '', gsub('[\\\\]', '__', txt))

... but it would be valuable I think to share a better "one hit" solution.

Jon
  • 445
  • 3
  • 15
  • It's hacky, but this worked for me. I couldn't find another solution that would work with my string: `x <- "\\nDécor is fun" gsub('__n', '', gsub('[\\\\]', '__', x))` But I'd be interested to know if there's a more elegant solution. – torenunez Apr 02 '20 at 23:55
1

Not very pretty, but this works:

library(stringr)
str_remove_all(txt, "(?<=\\\\n)\\s+|\\s+(?=\\\")|\\\"|(?<=\\\"),|\\\\r(?=\\\\n)|(?<=\\\\r)\\\\n")
[1] "c(Stuff from a webpage: That I scraped using webcrawler)"

I'm sure there are more efficient regex solutions, but I just fed it every possibility of things you don't want.

I also got rid of all the extra "\", ",", and white space.

If you just want to match the result that you posted above:

str_remove_all(txt, "\\\\r(?=\\\\n)|(?<=\\\\r)\\\\n")

This reads remove any instance of \\r followed by \\n or any \\n preceded by \\r

AndS.
  • 7,748
  • 2
  • 12
  • 17