Here's an attempt with preg_match
:
$pattern = "/^([^\[]+)\[([^\]]+)\]\s+\(([^,]+),\s+([^,]+),\s+([^,]+),\s+([^,]+)\)\s+(.+)$/i";
$string = "CADAVRES [FILM] (Canada : Québec, Érik Canuel, 2009, long métrage) FICTION";
preg_match($pattern, $string, $keywords);
array_shift($keywords);
print_r($keywords);
Output:
Array
(
[0] => CADAVRES
[1] => FILM
[2] => Canada : Québec
[3] => Érik Canuel
[4] => 2009
[5] => long métrage
[6] => FICTION
)
Try it!
Regex breakdown:
^ anchor to start of string
( begin capture group 1
[^\[]+ one or more non-left bracket characters
) end capture group 1
\[ literal left bracket
( begin capture group 2
[^\]]+ one or more non-right bracket characters
) end capture group 2
\] literal bracket
\s+ one or more spaces
\( literal open parenthesis
( open capture group 3
[^,]+ one or more non-comma characters
) end capture group 3
,\s+ literal comma followed by one or more spaces
([^,]+),\s+([^,]+),\s+([^,]+) repeats of the above
\) literal closing parenthesis
\s+ one or more spaces
( begin capture group 7
.+ everything else
) end capture group 7
$ EOL
This assumes your structure to be static and is not particularly pretty, but on the other hand, should be robust to delimiters creeping into fields where they're not supposed to be. For example, the title having a :
or ,
in it seems plausible and would break a "split on these delimiters anywhere"-type solution. For example,
"Matrix:, Trilogy() [FILM, reviewed: good] (Canada() : Québec , \t Érik Canuel , ): 2009 , long ():():[][]métrage) FICTIO , [(:N";
correctly parses as:
Array
(
[0] => Matrix:, Trilogy()
[1] => FILM, reviewed: good
[2] => Canada() : Québec
[3] => Érik Canuel
[4] => ): 2009
[5] => long ():():[][]métrage
[6] => FICTIO , [(:N
)
Try it!
Additionally, if your parenthesized comma region is variable length, you might want to extract that first and parse it, then handle the rest of the string.