0

RegEx:

<span style='.+?'>TheTextToFind</span>

HTML:

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;'>TheTextToFind</span></span>

Why does the match include this?

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED

Example Link

zzzzBov
  • 174,988
  • 54
  • 320
  • 367
bradvido
  • 2,743
  • 7
  • 32
  • 49
  • What language are you writing the regex in? More than likely there's an HTML parsing library for that language which would be [a more appropriate way of solving whatever problem you're actually having](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) – zzzzBov Feb 12 '14 at 16:15
  • vbscript is what I'm using the regex in – bradvido Feb 12 '14 at 16:38
  • This is just a quick dirty regex to find some data. Obviously actually parsing the HTML is a better solution – bradvido Feb 12 '14 at 17:37

1 Answers1

5

The regex engine always find the left-most match. That's why you get

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;'>TheTextToFind</span>

as a match. (Basically the whole input, sans the last </span>).

To steer the engine in the correct direction, if we assume that > doesn't appear directly in the attribute, the following regex will match what you want.

<span style='[^>]+'>TheTextToFind</span>

This regex matches what you want, since with the above assumption, [^>]+ can't match outside a tag.

However, I hope that you are not doing this as part of a program that extracts information out of a HTML page. Use HTML parser for that purpose.


To understand why the regex matches as such, you need to understand that .+? will try to backtracks so that it can find a match for the sequel ('>TheTextToFind</span>).

# Matching .+?
# Since +? is lazy, it matches . once (to fulfill the minimum repetition), and
# increase the number of repetition if the sequel fails to match
<span style='f                        # FAIL. Can't match closing '
<span style='fo                       # FAIL. Can't match closing '
...
<span style='font-size:11.0pt;        # PROCEED. But FAIL later, since can't match T in The
<span style='font-size:11.0pt;'       # FAIL. Can't match closing '
...
<span style='font-size:11.0pt;'>DON'  # PROCEED. But FAIL later, since can't match closing >
...
<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='
                                      # PROCEED. But FAIL later, since can't match closing >
...
<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;
                                      # PROCEED. MATCH FOUND.

As you can see, .+? attempts with increasing length and matches font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;, which allows the sequel '>TheTextToFind</span> to be matched.

nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • I still don't get the "left most match". Can you elaborate on how it works? – bradvido Feb 12 '14 at 16:50
  • @bradvido: Logically, you can think of the engine starting at index i, then it will exhaustively search for all substring starting at index i that matches the regex (the regex will determine the search order). If no match is found, then it will attempt at next index i+1. In this case, since it can find a match for the regex from index 0, it will return that match. – nhahtdh Feb 12 '14 at 16:56