4

Further on from my previous question about preg_split which was answers super fast, thanks to nick; I would really like to extend the scenario to no split the string when a delimiter is within quotes. For example:

If I have the string foo = bar AND bar=foo OR foobar="foo bar", I'd wish to split the sting on every space or = character but include the = character in the returned array (which works great currently), but I don't want to split the string either of the delimiters are within quotes.

I've got this so far:

<!doctype html>
<?php

$string = 'foo = bar AND bar=foo';

$array = preg_split('/ +|(=)/', $string, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

?>
<pre>
<?php

print_r($array);

?>
</pre>

Which gets me:

Array
(
    [0] => foo
    [1] => =
    [2] => bar
    [3] => AND
    [4] => bar
    [5] => =
    [6] => foo
)

But if I changed the string to:

$string = 'foo = bar AND bar=foo OR foobar = "foo bar"';

I'd really like the array to be:

Array
(
    [0] => foo
    [1] => =
    [2] => bar
    [3] => AND
    [4] => bar
    [5] => =
    [6] => foo
    [6] => OR
    [6] => foobar
    [6] => =
    [6] => "foo bar"
)

Notice the "foo bar" wasn't split on the space because it's in quotes?

Really not sure how to do this within the RegEx or if there is even a better way but all your help would be very much appreciated!

Thank you all in advance!

Community
  • 1
  • 1
Jonathon Oates
  • 2,912
  • 3
  • 37
  • 60

3 Answers3

6

Try

$array = preg_split('/(?: +|(=))(?=(?:[^"]*"[^"]*")*[^"]*$)/', $string, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

The

(?=(?:[^"]*"[^"]*")*[^"]*$)

part is a lookahead assertion making sure that there is an even number of quote characters ahead in the string, therefore it will fail if the current position is between quotes:

(?=      # Assert that the following can be matched:
 (?:     # A group containing...
  [^"]*" #  any number of non-quote characters followed by one quote
  [^"]*" #  the same (to ensure an even number of quotes)
 )*      # ...repeated zero or more times,
 [^"]*   # followed by any number of non-quotes
 $       # until the end of the string
)
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561
  • Not the OP, but trying to understand this. The idea is that if there are not an even number of quote characters, you're currently in the middle of a quoted section and should not be splitting, right? – KRyan Aug 08 '12 at 21:19
  • 1
    @DragoonWraith: Right. I have assumed that we're not expecting any escaped quotes in our strings. Those could be worked into the regex, too, but I didn't want to make this more complicated than necessary. – Tim Pietzcker Aug 08 '12 at 21:20
  • Excellent, thank you. Very nice; I was all set to comment that I didn't think RegEx could handle this. I never would have thought to use lookahead for an even number of quotes to ensure we're not in a quoted section. – KRyan Aug 08 '12 at 21:21
2

I was able to do this by adding quoted strings as a delimiter a-la

"(.*?)"| +|(=)

The quoted part will be captured. It seems like this is a bit tenuous and I did not test it extensively, but it at least works on your example.

Explosion Pills
  • 188,624
  • 52
  • 326
  • 405
  • Good idea. This should work unless quoted strings span multiple lines. – Tim Pietzcker Aug 08 '12 at 21:22
  • Awesome, I've added single quote check too [`'/"(.*?)"|(=)|\'(.*?)\'| +/'`] - this exactly fits the bill of what I needed. However, for others looking for a similar answer, this method strips the quotes, Tim's keep them in. This way works best for me but Tim's way is exceptional too! Thank you both! – Jonathon Oates Aug 08 '12 at 21:45
  • @JonathonDavidOates if you want to keep the quotes just put the parentheses outside of the quotes (e.g. `(".*?")`). I thought your sample array left them off but I see that it doesn't. – Explosion Pills Aug 08 '12 at 21:58
0

But why bother splitting?

After a look at this old question, this simple solution comes to mind, using a preg_match_all rather than a preg_split. We can use this simple regex to specify what we want:

"[^"]*"|\b\w+\b|=

See online demo.

zx81
  • 41,100
  • 9
  • 89
  • 105