2

I am using the preg_split function in PHP in order to create one array containing several different elements. However, I want to exclude a string which happens to contain one of the elements that I'm preg_splitting by.

$array['stuff'] = preg_split('/\[#]|\ &amp  |\ &amp |\&amp |\&amp|\ &amp|\ &gt  |\ &gt |\&gt |\&gt|\ &gt|\ &  |\ & |\& |\&|\ &|\ \/  |\ \/ |\\/ |\\/|\ \/|\ >  |\ > |\> |\>|\ >|\ ,  |\ , |\, |\,|\, |\ ::  |\ :: |\:: |\ ::|\::|\ ::|\ :  |\ : |\: |\:|\ :|\ -  |\ - |\- |\-|\ -/', $array['stuff'] ) ;

What I would like to do is to exclude a string such as 'foo-bar' from being matched for a split because it contains a dash. 'foo-bar' would need to be an exact match for my purposes.

Veger
  • 37,240
  • 11
  • 105
  • 116
Tony
  • 23
  • 5

2 Answers2

3

The resulting regular expression would be very complicated specially if you have a lot exceptions like 'foo-bar'.

You should use a conditional subpattern with a lookbehind as condition and a lookahead as its yes-pattern:

$res = preg_split('/(?(?<=foo)\-(?!bar)|\-)/', 'aasdf-fafsdf-foo-bar-asdf' );
var_dump( $res );

result:

array(4) {
  [0]=>
  string(5) "aasdf"
  [1]=>
  string(6) "fafsdf"
  [2]=>
  string(7) "foo-bar"
  [3]=>
  string(4) "asdf"
}

Let me explain what is happening here. \- means

Match any dash character.

but what we want is

Match any dash character that is not part of foo-bar.

Since we can't implement that in regex as it is we change it a little:

Match any dash character that if preceded by foo is not followed by bar.

To implement the if part we use a conditional subpattern, this is the syntax:

(?(condition)yes-pattern|no-pattern)

Our "condition" would be "preceded by foo" to check for that we use a lookbehind:

(?<=foo)

If that is true we should look for "a dash that is not followed by bar" to do that we use a negative lookahead:

\-(?!bar)

And that becomes our "yes-pattern". Our "no-pattern" should be \- or "any dash". The complete regex would be:

(?(?<=foo)\-(?!bar)|\-)

UPDATE: to incorporate this into your current regex change this part at the end:

|\ -  |\ - |\- |\-|\ -/

to

|\s?(?(?<=foo)\-(?!bar)|\-)\s?/
nobody
  • 10,599
  • 4
  • 26
  • 43
  • I'm not sure how this particular example would incorporate into the original preg_split conditions. I'm attempting various ways on my server right now and I can't find one which produces the desired result. Thanks for the response. – Tony Aug 06 '11 at 16:17
  • @Tony BTW the way you are detecting whitespaces right now is very inefficient do it like this: `/\s?(?:&amp|&gt|\/|\?|\:\:|\:|\-)\s?/` – nobody Aug 06 '11 at 17:48
  • Thanks for the help. It is much appreciated. Your 'foo-bar' code works perfect. Now, the reason why I'm doing preg_split in this manner is because sometimes there is white space (one or two white spaces) that are sometimes before and sometimes after the delimiters. And sometimes, there is no whitespace before or after the delimiter. Delimiters, totaling 10 thus far are: 1. [#] 2. &amp 3. &gt 4. > 5. & 6. , 7. ::, 8. / 9. :, 10. - being a n00b with php, I'm not sure if your more efficient code would be compatible with my purposes, and even it it were, I'm not sure how to implment it all. – Tony Aug 06 '11 at 22:31
  • @Tony those \s? at the beginning and the end of the pattern mean "one or zero spaces", it will match all the white space combinations you mentioned. – nobody Aug 06 '11 at 22:53
  • I've been racking my brain for a while now, attempting to learn from your code, and this is what I have. `preg_split('/\s?(?:&amp|&gt|\/|\?|\:\:|\:|\-|\>|\[#]|\,|\&|\::)\s?|\s?(?(?!blu$\-(?!ray)\-)\s?/', $result['categories'] ) ;` What I would like to do is ignore the dash in blu-ray, and this code does that. However, I would also like to ignore the a space that might also appear in blu ray, therefore, blu-ray and blu ray should be treated the same. Similarly, t-shirt and t shirt should be treated the same. I'm having a problem integrating the whole scope together. Man, I love regex!! – Tony Aug 14 '11 at 20:49
  • @Tony Check out the explanation I just added to the answer. – nobody Aug 15 '11 at 09:14
0

Though I make no guarantee that my solution is more efficient than nobody's double lookaround pattern for this case, I think my solution is slightly easier to read. (*SKIP)(*FAIL) effectively matches and discards the substrings that you wish to ignore. In some cases, this approach can be very useful/effective/maintainable.

Code: (Demo)

$string = 'I-like-candy-and-foo-bar-sandwiches';
var_export(preg_split('~foo-bar(*SKIP)(*FAIL)|-~', $string));

Output:

array (
  0 => 'I',
  1 => 'like',
  2 => 'candy',
  3 => 'and',
  4 => 'foo-bar',
  5 => 'sandwiches',
)

To be completely honest, I think nobody's answer is a bit over-engineered. It can be more simply written as a negated lookbehind and a negated lookahead ...no reason for the conditional syntax.

Code: (Demo)

$string = 'I-like-candy-and-foo-bar-sandwiches';
var_export(preg_split('~(?<!foo)-(?!bar)~', $string));

Output:

array (
  0 => 'I',
  1 => 'like',
  2 => 'candy',
  3 => 'and',
  4 => 'foo-bar',
  5 => 'sandwiches',
)

p.s. If you might have a hyphen at the start or end of your input string AND you don't want empty elements to be generated by preg_split(), then use 0 and PREG_SPLIT_NO_EMPTY as parameters 3 and 4 (respectively) in the function call.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136