0

I have the following code:

//Array filled with data from external file
$patterns = array('!test!', 'stuff1', 'all!!', '');

//Delete empty values in array
$patterns = array_filter($patterns);

foreach($patterns as &$item){
       $item = preg_quote($item);
}

$pattern = '/(\b|^|- |--|-)(?:'.implode('|', $patterns).')(-|--| -|\b|$)/i';

$clid = "I am the !test! stuff1 all!! string";

echo $clid;
$clid = trim(preg_replace($pattern, ' ', $clid));
echo $clid;

Output:

//I am the !test! stuff1 all!! string
//I am the !test! all!! string

I'm escaping the ! with preg_quote, so why?

I had a second problem, which is now solved, but I don't know why it happened. Suppose $clid = "I am Jörg Müller with special chars". If I remove the code line $patterns = array_filter($patterns); then the output after preg_replace was I am J. I cannot find out why, but I solved the problem with array_filter.

Thank you

Perocat
  • 1,481
  • 7
  • 25
  • 48

2 Answers2

1

The problem is you're using \b to assert for word boundaries. However, the character "!" is not a word character and \b doesn't match in between " !".

These are the word boundaries in $clid:

 I   a m   t h e   ! t e s t !   s t u f f 1   a l l ! !   s t r i n g
^ ^ ^   ^ ^     ^   ^       ^   ^           ^ ^     ^     ^           ^

You could use lookarounds to assert that each item is:

  1. (?:-[- ]?| +) matches -[ ], -, -- or one or more spaces.
  2. (?:-[- ]?|(?= )|$) matches -[ ], -, -- or asserts it's followed by a space or the end of line.

Regex

$pattern = '/(?:-[- ]?| +)(?:'.implode('|', $patterns).')(?:-[- ]?|(?= )|$)/i';

Code

//Array filled with data from external file
$patterns = array('!test!', 'stuff1', 'all!!', '');

//Delete empty values in array
$patterns = array_filter($patterns);

foreach($patterns as &$item){
       $item = preg_quote($item);
}

$pattern = '/(?:-[- ]?| +)(?:'.implode('|', $patterns).')(?:-[- ]?|(?= )|$)/i';


$clid = "I am the !test! stuff1 all!! string and !test!! not matched";
$clid = trim(preg_replace($pattern, '', $clid));

echo $clid;

Output

I am the string and !test!! not matched

ideone demo



As for your second question, you have an empty item in your array. So the regex would turn up to be:

(?:option1|option2|option3|)
                           ^

Notice there's a 4th option there: an empty subpattern. And an empty subpattern always matches. Your regex could be interpreted as:

/(\b|^|- |--|-)(-|--| -|\b|$)/i

which is why you had unexpected results

array_filter() solved your problem by removing empty items.

Mariano
  • 6,423
  • 4
  • 31
  • 47
  • Thank you! But why the empty subpattern always acted with special characters? – Perocat Nov 11 '15 at 00:15
  • So complete new pattern `(?<!\w)(- |--|-)?(?:' . implode('|', $patterns) . ')(-|--| -)?(?!\w)`. This way I will also match strings like `- TEST -` when having only `TEST` in $patterns, right? – Perocat Nov 11 '15 at 00:16
  • with `TEST` inside `$patterns` I want to match `- TEST -, --TEST--, -TEST-, - TEST--, - TEST-, --TEST -, --TEST-, -TEST--, -TEST -, TEST`. The `?` after second and fourth groups says zero or one time, right? So, it should work as I wrote, am I wrong? – Perocat Nov 11 '15 at 00:21
  • `- ?` means `-` OR `- ` (with and without space after?). Which is the exactly difference between starting a group with `?:` or without? – Perocat Nov 11 '15 at 00:30
  • Yes `- ?` is a dash followed by an optional space. The difference is it doesn't use memory to capture the text matched by that group. You can read about non-capturing groups in http://www.regular-expressions.info/brackets.html – Mariano Nov 11 '15 at 00:33
  • Suppose `$patterns` contains `!stuff`, with this regex `!stuff!` will be evaluted, but it shouldn't. Ok for `-< >?` but I can't use this at the end, right? Because I am expecting `-` or `< >-` – Perocat Nov 11 '15 at 00:33
  • I edited the answer to match the dashes, or to assert for spaces. If you find exceptions where it should/shouldn't match, please edit your question to clarify it further with some examples. However, I'm sure this would give you an idea of what you need to do to come up with the answer. – Mariano Nov 11 '15 at 01:01
1

The way I will do that:

$clid = "I am the !test! stuff1 all!! string";

$items = ['!test!', 'stuff1', 'all!!', ''];

$pattern = array_reduce($items, function ($c, $i) {
    return empty($i) ? $c : $c . preg_quote($i, '~') . '|';
}, '~[- ]+(?:');

$pattern .= '(*F))(?=[- ])~u';

$result = preg_replace($pattern, '', ' ' . $clid . ' ');
$result = trim($result, "- \t\n\r\0\x0b");

demo

The idea is to check a space or an hyphen after the "word" with a lookahead. In this way this "separator" is not consumed and the pattern can deal with consecutive matches.

To avoid an alternation at the beginning of the pattern (like (?:[- ]|^)[- ]* that is slow), I add a space at the beginning of the source string that is removed after the replacement with trim.

The (*F) (that forces the pattern to fail) is only here because the alternation of items is build with array_reduce that lets a trailing | at the end.

The problem with characters out of the ASCII range is solved with the u modifier. With this modifier the regex engine is able to deal with UTF-8 encoded strings.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125