5

I'm trying to extract information out of rc-files. In these files, "-chars in strings are escaped by doubling them ("") analog to c# verbatim strings. is ther a way to extract the string?

For example, if I have the following string "this is a ""test""" I would like to obtain this is a ""test"". It also must be non-greedy (very important).

I've tried to use the following regular expression;

"(?<text>[^""]*(""(.|""|[^"])*)*)"

However the performance was awful. I'v based it on the explanation here: http://ad.hominem.org/log/2005/05/quoted_strings.php

Has anybody any idea to cope with this using a regular expression?

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
MartenBE
  • 744
  • 1
  • 5
  • 20
  • And you are trying to do that using Perl? – Martin Ender Nov 21 '12 at 14:34
  • No I'm using C#. (I understand Perl and use it alot, but i cannot use it for this application) – MartenBE Nov 21 '12 at 14:38
  • Then why does this have a Perl tag? :D – Martin Ender Nov 21 '12 at 14:41
  • Because I thought it had something to do with perl regular expressions. I'm sorry if it caused any misunderstanding. – MartenBE Nov 21 '12 at 14:45
  • .NET has its own regex engine (which is in fact a lot more powerful than Perl's). – Martin Ender Nov 21 '12 at 14:46
  • 1
    @m.buettner "*which is in fact a lot more powerful than Perl's*" - There you go, trying to start a fight. – tylerl Nov 21 '12 at 14:50
  • @tylerl :D ... okay, probably not more powerful in terms of the (theoretical) languages it can match... since I guess balancing groups and recursion might be equally "powerful". But in terms of convenient features definitely. First and foremost, variable-length lookbehinds. And then it is the only engine that allows capturing of an arbitrary number of groups. That allows uses of the engine, which are simply not possible with a single regex in any other engine. – Martin Ender Nov 21 '12 at 14:52
  • @tylerl thinking about it... balancing groups **might** be more powerful (in theoretical terms) than recursion, since you have multiple stacks. I think in .NET you could match something like `(a) (b) (c) abc` (for an arbitrary amount of characters), while with recursion you could only do that for `(a) (b) (c) cba` – Martin Ender Nov 21 '12 at 15:00

5 Answers5

5

You've got some nested repetition quantifiers there. That can be catastrophic for the performance.

Try something like this:

(?<=")(?:[^"]|"")*(?=")

That can now only consume either two quotes at once... or non-quote characters. The lookbehind and lookahead assert, that the actual match is preceded and followed by a quote.

This also gets you around having to capture anything. Your desired result will simply be the full string you want (without the outer quotes).

I do not assert that the outer quotes are not doubled. Because if they were, there would be no way to distinguish them from an empty string anyway.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
2

This turns out to be a lot simpler than you'd expect. A string literal with escaped quotes looks exactly like a bunch of simple string literals run together:

"Some ""escaped"" quotes"

"Some " + "escaped" + " quotes"

So this is all you need to match it:

(?:"[^"]*")+

You'll have to strip off the leading and trailing quotes in a separate step, but that's not a big deal. You would need a separate step anyway, to unescape the escaped quotes (\" or "").

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
0

Don't if this is better or worse than m.buettner's (guessing not - he seems to know his stuff) but I thought I'd throw it out there for critique.

"(([^"]+(""[^"]+"")*)*)"
garyh
  • 2,782
  • 1
  • 26
  • 28
  • 1
    I think this has the same problem as the OP's attempt. But if you remove the second `""` and the last `*`, it should be pretty much as good as mine (even better, as it implements the "Unrolling-the-loop" optimization technique (@ridgerunner will sing a song about it, if he sees this anwer :D)). However, the `+`s require at least one non-quote character between the double quotes. You should probably make these `*`, too. (i.e. `"([^"]*(""[^"]*)*)"`) – Martin Ender Nov 21 '12 at 15:09
0

Try this (?<=^")(.*?"{2}.*?"{2})(?="$) it will be maybe more faster, than two previous and without any bugs.

og Grand
  • 114
  • 3
  • "without any bugs", quite a claim, don't you think? ;) ... this will gladly match `"something"here"then""this""and"so"on"""`, but neither `"something"`, `"some""thing"`, nor `"some""thing""like""this"` (the latter because it does not end with a triple code, which is required by your regex) – Martin Ender Nov 21 '12 at 15:36
0
  • Match a " beginning the string
  • Multiple times match a non-" or two "
  • Match a " ending the string

"([^"]|(""))*?"

Carlo V. Dango
  • 13,322
  • 16
  • 71
  • 114