1

I am working on a regex for matching all quotes (double " and single ' alike), which must have specific characters in front of them and will end upon reaching the same quote type or when encountering an HTML comment (<!--).

The rules of the game are:

  1. The HTML tag names themselves (e.g. "<a>") do not matter; the regex only takes the input from the attributes of the HTML element (<a all="of this in bold is the attribute section">)
  2. The regex must be able to find both single quotes (') and double quotes ("), but only escape upon reaching it's own quote type (\1), an HTML commenct (<!--) or end of input ($).
  3. If an HTML comment is encountered the quote will be interrupted, but is still considered a quote: <a id="works <!-- inpterrupted -->
  4. Only a specific set of characters must exist before the first quote, following this pattern: [^\w!#£¤€´¨-.:]

See this regex:

/[^\w!#£¤€´¨\-.:]('|")(.|\n)*?(\1|<!--|$)/

There's a problem in the DISALLOW area (at regexpal), though. The regex should never work here as the first characters are disallowed.

Thanks in advance for your help!

Clarification

Example here on regexpal.com. Everything - except the content under the DISALLOW section - is correct.

The desired result should be as follows. Bold indicates a match using the regex display above. The (many) HTML comments are there to end the HTML tags in a consistent way.

ALLOW

  • <a '' <!-- -->
  • <a $'' <!-- -->
  • <a %'' <!-- -->
  • <a &'' <!-- -->
  • <a /'' <!-- -->
  • <a ('' <!-- -->
  • <a )'' <!-- -->
  • <a {'' <!-- -->
  • <a }'' <!-- -->
  • <a ['' <!-- -->
  • <a ]'' <!-- -->
  • <a ='' <!-- -->
  • <a ?'' <!-- -->
  • <a +'' <!-- -->
  • <a `'' <!-- -->
  • <a |'' <!-- -->
  • <a ^'' <!-- -->
  • <a ~'' <!-- -->
  • <a *'' <!-- -->
  • <a ,'' <!-- -->
  • <a ;'' <!-- -->
  • <a <'' <!-- -->
  • <a \'' <!-- -->

DISALLOW

  • <a a'' <!-- -->
  • <a 9'' <!-- -->
  • <a !'' <!-- -->
  • <a #'' <!-- -->
  • <a £'' <!-- -->
  • <a ¤'' <!-- -->
  • <a €'' <!-- -->
  • <a ´'' <!-- -->
  • <a ¨'' <!-- -->
  • <a -'' <!-- -->
  • <a _'' <!-- -->
  • <a .'' <!-- -->
  • <a :'' <!-- -->

WITH BOTH QUOTE TYPES

  • <a single ='hey' double ="you" <!-- -->

STOP AT HTML QUOTE

  • <a =' <!-- this will break both the quotation and the HTML tag -->

END OF INPUT

<a ='

this - on a new line - is still part of the quote

Richard JP Le Guen
  • 28,364
  • 7
  • 89
  • 119
Kafoso
  • 534
  • 3
  • 20
  • use negative lookbehind: `(?<![\w!#£¤€´¨\-.:])` – hoaz Dec 07 '12 at 15:13
  • 1
    As you know: http://stackoverflow.com/a/1732454/15394 - so parse your html as html and then extract the data you are looking for by picking up the attribute values from the tags and matching only those. Much more resilient. – glenatron Dec 07 '12 at 15:19
  • 1
    @hoaz - Javascript doesn't support lookbehinds. – Andrew Cheong Dec 07 '12 at 15:19
  • http://regexr.com?332oa What exactly is your question? I've copied into RegExr, and I changed one thing (I used a lookahead for the html comment, instead of capturing it), but I don't see what the problem is... Though if you're using bold to indicate a match, then you have some inconsistent behaviour - why is all= matched, but id= is not? – FrankieTheKneeMan Dec 07 '12 at 15:49
  • @glenatron: I am not trying to extract the values within the id, class, style, title or other attributes. That very straight forward, especially when using jQuery. – Kafoso Dec 07 '12 at 16:33
  • @FrankieTheKneeMan: I will update the question with the desired behavior. Although, it is clearly demonstrated in the example above on regexpal.com. – Kafoso Dec 07 '12 at 16:33
  • I have no idea what you're asking for. – Richard JP Le Guen Dec 07 '12 at 16:51
  • "If an HTML comment is encountered the quote will be interrupted" - I don't believe HTML comments can be in attribute values ( http://jsfiddle.net/N9vkG/ ) so an HTML parser wouldn't "interrupt" the "quote". – Richard JP Le Guen Dec 07 '12 at 16:55

1 Answers1

1

I got it. Naturally, the quote characters at the beginning of the match should be excluded.

/[^\w!#£¤€´¨\-.:'"]('|")(.|\n)*?(\1|<!--|$)/
Kafoso
  • 534
  • 3
  • 20