Recursive Regex in PHP with variable names

Question

I try to make bbcode-ish engine for me website. But the thing is, it is not clear which codes are available, because the codes are made by the users. And on top of that, the whole thing has to be recursive.

For example:

Hello my name is [name user-id="1"]
I [bold]really[/bold] like cheeseburgers

These are the easy ones and i achieved making it work.

Now the problem is, what happens, when two of those codes are behind each other:

I [bold]really[/bold] like [bold]cheeseburgers[/bold]

Or inside each other

I [bold]really like [italic]cheeseburgers[/italic][/bold]

These codes can also have attributes

I [bold strengh="600"]really like [text font-size="24px"]cheeseburgers[/text][bold]

The following one worked quite well, but lacks in the recursive part (?R)

(?P<code>\[(?P<code_open>\w+)\s?(?P<attributes>[a-zA-Z-0-1-_=" .]*?)](?:(?P<content>.*?)\[\/(?P<code_close>\w+)\])?)

I just dont know where to put the (?R) recursive tag.

Also the system has to know that in this string here

I [bold]really like [italic]cheeseburgers[/italic][/bold] and [bold]football[/bold]

are 2 "code-objects":

1. [bold]really like [italic]cheeseburgers[/italic][/bold]

and

2. [bold]football[/bold]

... and the content of the first one is

really like [italic]cheeseburgers[/italic]

which again has a code in it

[italic]cheeseburgers[/italic]

which content is

cheeseburgers

I searched the web for two days now and i cant figure it out.

I thought of something like this:

Look for something like [**** attr="foo"] where the attributes are optional and store it in a capturing group
Look up wether there is a closing tag somewhere (can be optional too)
If a closing tag exists, everything between the two tags should be stored as a "content"-capturing group - which then has to go through the same procedure again.

I hope there are some regex specialist which are willing to help me. :(

Thank you!

EDIT

As this might be difficult to understand, here is an input and an expected output:

Input:

[heading icon="rocket"]I'm a cool heading[/heading][textrow][text]<p>Hi!</p>[/text][/textrow]

I'd like to have an array like

array[0][name] = heading
array[0][attributes][icon] = rocket
array[0][content] = I'm a cool heading
array[1][name] = textrow
array[1][content] = [text]<p>Hi!</p>[/text]
array[1][0][name] = text
array[1][0][content] = <p>Hi!</p>

I'd take a look at this thread; http://stackoverflow.com/questions/6773192/recursive-bbcode-parsing. — chris85, Dec 22 '15 at 16:51
chris85, isn't this too simple? I just cant use a simple replace because in some codes I need to call classes which then has to do some database functions for example. I need all the data stored in an array. — SunTastic, Dec 22 '15 at 16:54
anubhava, [heading icon="rocket"]I'm a cool heading[/heading][textrow][text]
Hi!
[/text][/textrow] - here i need an array that says ok we have two codes "heading" and "text" the first one has "I'm a cool heading" as content inside (plus an attribute "icon" which is "rocket"), the second has "[text]
Hi!
[/text]" inside - which AGAIN has a code inside "text" with the content "
Hi!
" - So there has to be an array-tree which "represents" the structure — SunTastic, Dec 22 '15 at 16:56
I've added a concrete example of input and output in the EDIT part of the question — SunTastic, Dec 22 '15 at 17:03
I don't know how to use the `(?R)`, but I'm really curious on how... you can try something with this pattern: [`(?s)\[(?!\/)([^\s\]]+)[^]]*\](.*?)\[\/\1\]`](https://regex101.com/r/qY4qC0/1) — , Dec 22 '15 at 17:33
The problem with using a regex for this type of problem is that this is not what regex is good at. Regex's are meant to be used with a regular language, what you have is a context-free grammar. CFG's should be parsed using some sort of push-down state machine instead of a regular expression. "every regular language is context-free. The converse is not true: for example the language consisting of all strings having the same number of a's as b's is context-free but not regular." https://en.wikipedia.org/wiki/Regular_language — Scott, Dec 22 '15 at 19:43
That's why I'd recommend using my library: https://github.com/thunderer/Shortcode with provided `RegularParser`. — Tomasz Kowalczyk, Dec 29 '15 at 13:54

score 2 · Answer 1 · answered Dec 22 '15 at 17:06

Having written multiple BBCode parsing systems, I can suggest NOT using regexes only. Instead, you should actually parse the text.

How you do this is up to you, but as a general idea you would want to use something like strpos to locate the first [ in your string, then check what comes after it to see if it looks like a BBCode tag and process it if so. Then, search for [ again starting from where you ended up.

This has certain advantages, such as being able to examine each code and skip it if it's invalid, as well as enforcing proper tag closing order ([bold][italic]Nesting![/bold][/italic] should be considered invalid) and being able to provide meaningful error messages to the user if something is wrong (invalid parameter, perhaps) because the parser knows exactly what is going on, whereas a regex would output something unexpected and potentially harmful.

It might be more work (or less, depending on your skill with regex), but it's worth it.

I already thought of that, but then i totally got stuck with regex and regex101.com to play around with. I really hoped to achieve a solution with just regex. But you might be right. Do you maybe have a suggestion on how to start implementing a "own parser"? — SunTastic, Dec 22 '15 at 17:11
As a basic idea, you will want `$input = "..."; $pointer = 0; $output = "";`, then you can do something like `while(is_int($bracket = strpos($input,'[',$pointer))) { $output .= substr($input,$pointer,$bracket-$pointer); /* do some regex to get the tag from substr($input,$bracket) and process stuff here - you will need to get the position of the ] in here */ $pointer = $closeBracketPosition; }` - this is just a very basic idea but hopefully it helps. — Niet the Dark Absol, Dec 22 '15 at 17:17

Recursive Regex in PHP with variable names

1 Answers1