9

I'm trying to make a php regex work that parses a string for text in brackets while ignoring possible nested brackets:

Let's say I want

Lorem ipsum [1. dolor sit amet, [consectetuer adipiscing] elit.]. Aenean commodo ligula eget dolor.[2. Dolor, [consectetuer adipiscing] elit.] Aenean massa[3. Lorem ipsum] dolor.

to return

[1] => "dolor sit amet, [consectetuer adipiscing] elit."
[2] => "Dolor, [consectetuer adipiscing] elit."
[3] => "Lorem ipsum"

So far i got

'/\[([0-9]+)\.\s([^\]]+)\]/gi'

but it breaks when nested brackets occur. See demo

How can i ignore the inner brackets from detection? Thx in advance!

hm711
  • 168
  • 8
  • Because of the nested structure, I believe that the regex is not suitable for the case. Maybe a simple routine is a better approach. – someOne Sep 30 '15 at 08:28

3 Answers3

5

You can use recursive references to previous groups:

(?<no_brackets>[^\[\]]*){0}(?<balanced_brackets>\[\g<no_brackets>\]|\[(?:\g<no_brackets>\g<balanced_brackets>\g<no_brackets>)*\])

See it in action

The idea is to define your desired matches as either something with no brackets, surrounded by [] or something, which contains a sequence of no brackets or balanced brackets with the first rule.

ndnenkov
  • 35,425
  • 9
  • 72
  • 104
2

You can use this pattern that captures the item number and the following text in two different groups. If you are sure all item numbers are unique, you can build the associative array described in your question with a simple array_combine:

$pattern = '~\[ (?:(\d+)\.\s)? ( [^][]*+ (?:(?R) [^][]*)*+ ) ]~x';

if (preg_match_all($pattern, $text, $matches))
    $result =  array_combine($matches[1], $matches[2]);

Pattern details:

~     # pattern delimiter
\[    # literal opening square bracket
(?:(\d+)\.\s)? # optional item number (*) 
(              # capture group 2
   [^][]*+         # all that is not a square bracket (possessive quantifier)
   (?:             # 
       (?R)        # recursion: (?R) is an alias for the whole pattern
       [^][]*      # all that is not a square bracket
   )*+             # repeat zero or more times (possessive quantifier)
)
]                  # literal closing square bracket
~x  # free spacing mode

(*) note that the item number part must be optional if you want to be able to use the recursion with (?R) (for example [consectetuer adipiscing] doesn't have an item number.). This can be problematic if you want to avoid square brackets without item number. In this case you can build a more robust pattern if you change the optional group (?:(\d+)\.\s)? to a conditional statement: (?(R)|(\d+)\.\s)

Conditional statement:

(?(R)        # IF you are in a recursion
             # THEN match this (nothing in our case)
  |          # ELSE
  (\d+)\.\s  #   
)

In this way the item number becomes mandatory.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
1

You can use a recursive regex to obtain all the substrings enclosed with square brackets, and then use a preg_replace inside an array_map to remove brackets and enclosing brackets:

$str = "Lorem ipsum [1. dolor sit amet, [consectetuer adipiscing] elit.]. Aenean commodo ligula eget dolor.[2. Dolor, [consectetuer adipiscing] elit.] Aenean massa[3. Lorem ipsum] dolor.";
preg_match_all('/\[(?>[^\[\]]|(?R))*]/', $str, $matches);
$res = array_map(function($el) {
    return preg_replace('/^\[\d+\.(.*?)\s*\]$/s', '$1', $el); 
    },
    $matches[0]);
print_r($res);

See IDEONE demo

The \[(?>[^\[\]]|(?R))*] regex matches [, then anything but [ and ] or the nested [...] constructs. See more about recursion with regex at regular-expressions.info. Here is the regex demo.

The regex inside the preg_repace - ^\[\d+\.(.*?)\s*\]$ - will match the initial [ with 1 or more digits and a period after, and match and capture the rest up to the final optional whitespace (\s*) and closing ] (the $ will make sure the bracket is matched at the end of the string). With $1 we can restore the rest of the string and use it to populate a new array. See the 2nd regex demo here.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563