Reg-Ex for filtering out, parsing and replacing a specific string? (php)

Question

While developing a private CMS for a client, I've had an idea to implement a php-underlying, yet server-side and flexible "language".

I'm in trouble finding a reqular-expression finding (filter..) the following string ( [..] is the code, which'll be parsed after it's been filtered out ), I want to filter the string out with the line-breaks.

<(
    [..]
)>

I was looking for a solution all night, but I didn't find a solution.

If you're unable to write your own regular expressions, are you sure you should be writing your own languages, which if you want to make them more than simple string placeholders, can't be parsed by regular expressions anyway? Maybe you should be using someone else's template system in your CMS. Markdown, Smarty, something. People have been writing template languages for decades, you don't have to reinvent the wheel. — Dan Grossman, Jan 22 '11 at 10:40
It's not like I'm not able to write regular expressions. It's like I'm not able to be parsing a string with that format and line-breaks in the *php*-format. — 19h, Jan 22 '11 at 11:22
so - you want to get verbatim content, everything in between "<(" and ")>" ? No special rules on comments or escaping? — foo, Jan 22 '11 at 11:27
Not at all. It's probably been a mistake of mine writing the post here. I'm sorry for the inconvenience :( — 19h, Jan 22 '11 at 11:56

score 1 · Accepted Answer · edited May 23 '17 at 09:58

1

First off: Listen to Dan Grossmans advice above.

From my current understanding of your question, you want to get the verbatim content between <( and )> - no exceptions, no comment handling.

If so, try this RegExp

'/<\(((?:.|\s)*?)\)>/'

which you can use like this

preg_match_all('/<\(((?:.|\s)*?)\)>/', $yourstring, $matches)

It doesn't need case insensitivity, and it does lazy matching (so you can apply it to a string with several instances of matches).

Explanation of the RegExp: Starting with <(, ending with )> (brackets escaped of course), in between is the capturing group. At its core, we take either regular characters . or whitespace \s (which solves your problem, since line breaks are whitespace too). We don't want to capture every single character, so the inner group is non capturing - just either whitespace or character: (?:.|\s). This is repeated any number of times (including zero), but only until the first match is complete: *? for lazy 0-n. That's about it, hope it helps.

edited May 23 '17 at 09:58

Community

1
1

answered Jan 22 '11 at 10:50

foo

1,968
1
23
35

-1, consider: `` and your suggestion matches the very first `
` in the string and then consumes everything until it hits the end-of-string, after which it backtracks to the last `
` (so if there are more than 2 `
`'s in the file, you're in trouble). You should have stopped typing after the first sentence. :) – Bart Kiers Jan 22 '11 at 10:59
It's not intended to parse an entire file, comments or anything generic - the very use of two
implies to me the content is chopped up before this RegExp will be applied. Otherwise, it would be ambigous which content is to be taken if there are several
. In the second paragraph of my answer, I talk about this. In order to construct a safe, repeatable RegExp, there should be more information about the format. – foo Jan 22 '11 at 11:02
I agree: there should be more information about the format before being able to give a proper answer (so, IMO, you should have waited with an answer as well). As it is now, it still deserves a -1 IMO. – Bart Kiers Jan 22 '11 at 11:04
Maybe it's best to not be thinking of the code being dependent on the
's. Check out the edit of the post. – 19h Jan 22 '11 at 11:24
Thanks, I did and revised the RegExp. – foo Jan 22 '11 at 11:33
Thank you! :) This is even pleasant to implement ;) – 19h Jan 22 '11 at 11:57
1

@foo: That `(?:.|\s)*?` is going to get you in big trouble one of these days; see this answer for the reason why: http://stackoverflow.com/questions/2407870/javascript-regex-hangs-using-v8/2408599#2408599 The correct way to match anything-including-newlines is to use the dot in single-line a.k.a. dot-matches-all mode. This is usually done by adding the `s` modifier to the end of the regex. By the way, the `m` modifier in your regex serves no purpose. – Alan Moore Jan 29 '11 at 07:59
For this specific question, a catastrophic backtrack is quite unlikely, but thanks for pointing it out for the general case. You're correct about the /m, it's a leftover from the answer to the first version of the question. edited it out. – foo Feb 19 '11 at 02:14

Reg-Ex for filtering out, parsing and replacing a specific string? (php)

1 Answers1