1

I wrote this regex to match strings:

(?>(?<Quote>""|').*?(?<!\\)\k<Quote>)

i.e., some text enclosed in quotes. It also supports escaping, so it will match "hello\"world" in its entirety without stopping at the first quote, like I want. But I forgot about double-escaping. "hello\\"world" is not valid, for example.

I'm pretty sure this is possible to fix with balancing groups, but I've never really used them before. Anyone know how to write this?

mpen
  • 272,448
  • 266
  • 850
  • 1,236

1 Answers1

1

Regular expressions are not meant to be used for escaped constructs.

I don't think it's possible to do this in any "nice" kind of way (if at all), although I'll post an edit if I figure out otherwise.

Balancing group definitions are for nested constructs. Nesting doesn't happen in strings, so balancing group definitions don't seem to even be the right tool for this.


Edit 1:

It depends on how many features you're looking for. If you simply want to match the next escaped quotation, you can use the pattern

^"([^\\\"]|\\.)*"

which, when escaped for code, turns out like

"^\"([^\\\\\\\"]|\\\\.)*\""

to match something like

"Hello! \" Hi! \" "

but as soon as you start adding more complicated requirements like Unicode escapes, it becomes a lot more tedious. Just do it by hand, it should be much simpler.


Edit 2:

If you're curious about how balancing group definitions work anyway, I recommend reading page 430 of this book (34 in pdf).

Community
  • 1
  • 1
user541686
  • 205,094
  • 128
  • 528
  • 886
  • Took me a few minutes to realize you escaped that regex for use in code...that just makes it confusing. Should be `"((?>[^\\\"]|\\.)*)"`. Why can balancing groups only be used for nesting constructs? We want to *balance* the number of slashes, no? Final quote should be proceeded by an even number of slashes, or *not* proceeded by an odd number. I don't want to do it behind because this is part of a much larger regex. – mpen May 11 '11 at 03:41
  • @Mark: Yeah sorry, that was escaped. :\ (**Edit:** I made it a bit more readable.) You don't need any balancing here, because there's nothing you're trying to "balance" -- all you're saying is, "a string is a sequence of non-escape characters and/or escaped characters", which is all done in a single pass. There's no need for anything to be balanced anywhere. (I'm not sure what you mean by the even/odd slashes... could you explain what you mean?) – user541686 May 11 '11 at 03:46
  • Actually I think it only needs 2 slashes in the `[]`. By even/odd... I mean `"hello\"world"` should match (1 slash), but `"hello\\\"world"` should also match (3 slashes). the 1st slash escapes the 2nd, and the 3rd escapes the quote. so if there's an odd number of slashes before the quote, then it is considered escaped. – mpen May 11 '11 at 03:55
  • Also, I think it's much easier to read with the @-quote notation: `@"""((?>[^\\""]|\\.)*)"""`. Well... slightly easier. This is a bit of an unfortunate example because it deals with quotes as well, but at least the slashes aren't ridiculous. – mpen May 11 '11 at 03:57
  • @Mark: Yeah they both worked, I wasn't sure if `"` was special or not so I escaped it. Regarding the even/odd: it actually *does* do what you want! You just forgot to put the entire thing in quotes (`"\"hello\\\"world\""`). And yeah, I tried using verbatim strings but they weren't too much better, whatever. (Btw, those are backslashes, not slashes. :P) – user541686 May 11 '11 at 04:01
  • Backslashes are a type of slash! And yes, I wasn't saying your solution was wrong :) Seems to be working in all my tests. I'm trying to get as close as possible to this definition http://www.w3.org/TR/CSS21/syndata.html#strings I suspect slashes and newlines will be pretty rare so I'm not overly concerned, but the closer I can get the better. – mpen May 11 '11 at 04:04
  • Is the `?>` necessary? When would backtracking ever cause it to fail? Or is it an efficiency thing? – mpen May 11 '11 at 04:14
  • @Mark: The `?>` was an efficiency thing, not for correctness. :) If it fails any test lemme know! – user541686 May 11 '11 at 05:01
  • @Mehrdad: Will do. I updated it to `"((?>[^\\"\r\n]|\\\r\n|\\.)*)"|'((?>[^\\'\r\n]|\\\r\n|\\.)*)'` which *should* handle newlines according to the spec as well. i.e., they're not allowed unless escaped. – mpen May 11 '11 at 05:38
  • @Mark: Cool! Like I said, though, if you run into anything more complicated (say, `\x0123`) then it'll get more and more tedious, you might just want to do it by hand. :) – user541686 May 11 '11 at 05:41