4

I would like to split a string in PHP containing quoted and unquoted substrings.
Let's say I have the following string:

"this is a string" cat dog "cow"  

The splitted array should look like this:

array (  
[0] => "this is a string"  
[1] => "cat"  
[2] => "dog"  
[3] => "cow"  
)

I'm struggling a bit with regex and I'm wondering if it is even possible to achieve with just one regex/preg_split-Call...

The first thing I tried was:

[[:blank:]]*(?=(?:[^"]*"[^"]*")*[^"]*$)[[:blank:]]*

But this splits only array[0] and array[3] correctly - the rest is splitted on a per character base.

Then I found this link:
PHP preg_split with two delimiters unless a delimiter is within quotes

(?=(?:[^"]*"[^"]*")*[^"]*$)

This seems to me as a good startingpoint. However the result in my example is the same as with the first regex.

I tried combining both - first the one for quoted strings and then a second sub-regex which should ommit quoted string (therefore the [^"]):

(?=(?:[^"]*"[^"]*")*[^"]*$)|[[:blank:]]*([^"].*[^"])[[:blank:]]*

Therefore 2 questions:

  1. Is it even possible to achieve what I want with just one regex/preg_split-Call?
  2. If yes, I would appreciate a hint on how to assemble the regex correctly
Community
  • 1
  • 1
Stefan
  • 337
  • 6
  • 20

1 Answers1

4

Since matches cannot overlap, you could use preg_match_all like this:

preg_match_all('/"[^"]*"|\S+/', $input, $matches);

Now $matches[0] should contain what you are looking for. The regex will first try to match a quoted string, and then stop. If that doesn't do it it will just collect as many non-whitespace characters as possible. Since alternations are tried from left to right, the quoted version takes precedence.

EDIT: This will not get rid of the quotes though. To do this, you could use capturing groups:

preg_match_all('/(?|"([^"]*)"|(\S+))/', $input, $matches);

Now $matches[1] will contain exactly what you are looking for. The (?| is there so that both capturing groups end up at the same index.

EDIT 2: Since you were asking for a preg_split solution, that is also possible. We can use a lookahead, that asserts that the space is followed by an even number of quotes (up until the end of the string):

$result = preg_split('/\s+(?=(?:[^"]*"[^"]*")*$)/', $input);

Of course, this will not get rid of the quotes, but that can easily be done in a separate step.

Martin Ender
  • 43,427
  • 11
  • 90
  • 130
  • Thanks a lot! As I need this for MATCH-AGAINST-Validation in MySQL, preg_split('/\s+(?=(?:[^"]*"[^"]*")*$)/', $input) is exactly what i was looking for – Stefan Nov 08 '12 at 17:54
  • just did some further testing: the preg_split regex stops working as soon as the input string is altered like: cat dog "this is a string" "cow" However the "/(?|"([^"]*)"|(\S+))/" with preg_match_all does the job. – Stefan Nov 08 '12 at 18:39
  • @Stefan huh, that's odd the split solution works with that string for me, too. but I'm glad you could get the other one to work. – Martin Ender Nov 08 '12 at 22:15