inverse of gsub

Question

I have some html code I'm working with. I want to extract certain strings.

I want to extract this from string x preferred using base R: coleman_l, SMOG4

Here is what I have:

x <- "<code>(hi)<a href=\"Read\">auto</a></code>(coleman_l, SMOG4)<br />Read</li>" 
#remove the string (this works)
gsub("a></code>(.+?)<br", "a></code><br", x)

#> gsub("a></code>(.+?)<br", "a></code><br", x)
#[1] "<code>(hi)<a href=\"Read\">auto</a></code><br />Read</li>"

#attempt to extract that information (doesn't work)
re <- "(?<=a></code>().*?(?=)<br)"
regmatches(x, gregexpr(re, x, perl=TRUE))

Error message:

> regmatches(x, gregexpr(re, x, perl=TRUE)) 
Error in gregexpr(re, x, perl = TRUE) : 
  invalid regular expression '(?<=a></code>().*?(?=)<br)'
In addition: Warning message:
In gregexpr(re, x, perl = TRUE) : PCRE pattern compilation error
        'lookbehind assertion is not fixed length'
        at ')'

    enter code here

NOTE: Tagged as regex but this is R specific regex.

@Ben I edited to say preferred base R so this question is more usable by future searchers. Please add it as a solution. — Tyler Rinker, Feb 08 '13 at 04:03
I know you said base R, but using the `XML` library and it's friends `htmlTreeParse` or `xmlTreeParse` might be more appropriate than using regex to deal with html code. — thelatemail, Feb 08 '13 at 04:14
@AnandaMahto perfect. Please add it as the solution. That was one attempt I made but I was way off with my `gsub` attempt. — Tyler Rinker, Feb 08 '13 at 04:22

score 8 · Accepted Answer · answered Feb 08 '13 at 04:27

For these types of problems, I would use backreferences to extract the portion I want.

x <- 
  "<code>(hi)<a href=\"Read\">auto</a></code>(coleman_l, SMOG4)<br />Read</li>" 
gsub(".*a></code>(.+?)<br.*", "\\1", x)
# [1] "(coleman_l, SMOG4)"

If the parentheses should also be removed, add them to the "plain text" part that you are tying to match, but remember that they would need to be escaped:

gsub(".*a></code>\\((.+?)\\)<br.*", "\\1", x)
# [1] "coleman_l, SMOG4"

CHP · Answer 2 · 2013-02-08T05:05:00.780

7

FWIW, OP's original approach could have worked with little tweak.

> x
[1] "<code>(hi)<a href=\"Read\">auto</a></code>(coleman_l, SMOG4)<br />Read</li>"
> re <- "(?<=a></code>\\().*?(?=\\)<br)"
> regmatches(x, gregexpr(re, x, perl=TRUE))
[[1]]
[1] "coleman_l, SMOG4"

An advantage of doing it this way compared to other suggested solution is that if there is possibility of multiple matches, then all of them will show up.

> x <- '<code>(hi)<a href=\"Read\">auto</a></code>(coleman_l, SMOG4)<br />Read</li><code>(hi)<a href=\"Read\">auto</a></code>(coleman_l_2, SMOG4_2)<br />Read</li>'
> regmatches(x, gregexpr(re, x, perl=TRUE))
[[1]]
[1] "coleman_l, SMOG4"     "coleman_l_2, SMOG4_2"

edited Feb 08 '13 at 05:05

answered Feb 08 '13 at 04:37

CHP

16,981
4
38
57

1

Can't you change "re" to `re <- "(?<=a>\\().*?(?=\\)
– A5C1D2H2I1M1N2O1R2T1 Feb 08 '13 at 04:51
I swear i tried that but it didn't work :P... amending my solution. – CHP Feb 08 '13 at 04:55

score 5 · Answer 3 · answered Feb 08 '13 at 04:23

5

This will work, despite being ugly.

x<-"<code>(hi)<a href=\"Read\">auto</a></code>(coleman_l, SMOG4)<br />Read</li>"

x2 <- gsub("^.+(\\(.+\\)).+\\((.+)\\).+$","\\2",x)
x2
[1] "coleman_l, SMOG4"

answered Feb 08 '13 at 04:23

thelatemail

91,185
12
128
188

Are regexes usually pretty? +1 – A5C1D2H2I1M1N2O1R2T1 Feb 08 '13 at 04:27
Haven't seen a useful and pretty regex :-) thanks for your response. +1 – Tyler Rinker Feb 08 '13 at 04:43

inverse of gsub

3 Answers3

Linked