1

I'm trying to build a RegEx string for use in a find and replace in sublime text or notepad++ to remove strikethrough text from a html page. In general, the strikethrough is formatted as follows:

<span style="color: rgb(255,0,0);"><s>Some text here</s></span>

So far, I've come up with this:

<span.*<s>.*<\/s><\/span>

But it doesn't stop at the first </span>, it continues on so I get a huge slab of text selected. I've had a look at the regex wiki (and several other resources), and I'm sure this is a "greedy matches" issue, but I can't get my head around what that should look like.

Edit: I'm not set on RegEx by the way, if anyone has a better solution of how to achieve what I'm after I'm all ears

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Geoff
  • 83
  • 1
  • 7
  • Possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – MaxZoom Sep 28 '16 at 01:47
  • @maxzoom: He's *not* trying to parse an HTML document, though. – Daniel McLaury Sep 28 '16 at 02:00

2 Answers2

2

The best way to limit a greedy match is to make it stop at a specific character. [abc] is a character class meaning any of a, b, c, while [^abc] means anything but a, b, c. So [^<] means anything but <.

<span[^>]*><s>[^<]*</s></span>

The other (much slower) way is to set the * or + operator to return the shortest match. In Perl-compatible regex, you do this with *? or +?.

dwks
  • 602
  • 5
  • 10
  • Works perfectly, a million thankyous! Will mark this as the correct answer when the time limit allows :) – Geoff Sep 28 '16 at 01:50
  • Your first regex won't work if there are other tags in the strikeout, e.g. `first second third`. (The second strategy will work in this case.) – Daniel McLaury Sep 28 '16 at 01:51
  • True, if there are other nested tags then it may be better to use `.*?` between the strikeout tags. – dwks Sep 28 '16 at 01:53
1

To expand on dwks's answer and the comments on it, if there are any HTML tags at all inside the struck-through text, e.g. if it looks like

<span><s>first <b>second</b> third</s></span>

then it won't match the regex

<span[^>]*><s>[^<]*</s></span>

since this regex won't stand for a < between the <s> and </s>. At the end of the answer it's mentioned that you can use *?. For the sake of completeness, that regex would look something like this:

<span[^>]*><s>.*?<\/s><\/span>
Daniel McLaury
  • 4,047
  • 1
  • 15
  • 37