4

Because of the way that jQuery deals with script tags, I've found it necessary to do some HTML manipulation using regular expressions (yes, I know... not the ideal tool for the job). Unfortunately, it seems like my understanding of how captured groups work in JavaScript is flawed, because when I try this:

var scriptTagFormat = /<script .*?(src="(.*?)")?.*?>(.*?)<\/script>/ig;

html = html.replace(
    scriptTagFormat, 
    '<span class="script-placeholder" style="display:none;" title="$2">$3</span>');

The script tags get replaced with the spans, but the resulting title attribute is blank. Shouldn't $2 match the content of the src attribute of a script tag?

ekad
  • 14,436
  • 26
  • 44
  • 46
Jacob
  • 77,566
  • 24
  • 149
  • 228

5 Answers5

5

Nesting of groups is irrelevant; their numbering is determined strictly by the positions of their opening parentheses within the regex. In your case, that means it's group #1 that captures the whole src="value" sequence, and group #2 that captures just the value part.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
1

The .*? matches too much because the following group is optional, ==> your src is matched from one of the .*? around. if you remove the ? after your first group it works.

Update: As @morja pointed out your solution is to move the first .*? into the optional src part.

Just for completeness: /<script (?:.*?(src="(.*?)"))?.*?>(.*?)<\/script>/ig

You can see it here on rubular (corrected my link also)

If you don't want to use the content of the first capturing group, then make it a non capturing group using (?:)

/<script (?:.*?(?:src="(.*?)"))?.*?>(.*?)<\/script>/ig

Then your wanted result is in $1 and $2.

stema
  • 90,351
  • 20
  • 107
  • 135
  • I just want to capture the `src` attribute of a `script` tag, if it exists, no matter where it is in the tag. – Jacob May 05 '11 at 21:59
1

Try this:

/<script (?:(?!src).)*(?:src="(.*?)")?.*?>(.*?)<\/script>/ig

See here: rubular

As stema wrote, the .*? matches too much. With the negative lookahead (?:(?!src).)* you will match only until a src attribute.

But actually in this case you could also just move the .*? into the optional part:

/<script (?:.*?src="(.*?)")?.*?>(.*?)<\/script>/ig

See here: rubular

morja
  • 8,297
  • 2
  • 39
  • 59
0

Could you post the html you are retrieving? Your code works fine in a simple example: jsfiddle (warning: alert box)

My first guess is that one of your script tags does not have a src meaning you are left with a single capture group (the script contents).

WSkid
  • 2,736
  • 2
  • 22
  • 26
  • Interesting... if you put a `type="text/javascript"` in front of the `src` attribute, you'll see the problem. It looks like it might not be the groups that are the problem but rather the way that non-greedy captures work. – Jacob May 05 '11 at 20:57
0

I'm thinking that regular expressions by themselves can't do exactly what I'm looking for, so here's my modification to work around the problem:

var scriptTagFormat = /<script\s+((.*?)="(.*?)")*\s*>(.*?)<\/script>/ig;

html = html.replace(
    scriptTagFormat, 
    '<span class="script-placeholder" style="display:none;" $1>$4</span>');

Before, I wanted to avoid setting non-standard attributes on the replacement span. This code blindly copies all attributes instead. Luckily, the non-standard attributes aren't stripped out of the DOM when I insert the HTML, so it will work for my purposes.

Jacob
  • 77,566
  • 24
  • 149
  • 228