0

I have the following code where I am trying to match specific words exactly using word boundaries, replace them with "censored" and then rebuild the text but for some reason the regex is catching a trailing slash. I've simplified down to the following test case for clarity

<?php

$words = array('bad' => "censored");
$text = "bad bading testbadtest badder";
$newtext = "";

foreach( preg_split( "/(\[\/?(?:acronym|background|\*)(?:=.+?)?\]|(^|\W)bad(\W|$))/i", $text, null, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY ) as $section )
{
    if ( isset( $words[ $section ] )  )
    {
        $newtext .= $words[ $section ];
    }
    else
    {
        $newtext .= $section ;
    }
}

var_dump($newtext);

exit;

In this example I am expecting to match on "bad" but not bading testbadtest or badder. The issue is "bad " (note the trailing space) is being matched which does not exist as a key in the $words array.

Could somebody please explain where I may be going wrong?

Thanks in advance

  • 2
    `bad(\W|$)` means - `bad` followed by any non-word character (or the end of the string), which is a space. What you need is assertions, like `bad(?=\W)`, or `bad\b`. http://us2.php.net/manual/en/regexp.reference.assertions.php – zerkms Oct 25 '13 at 23:24
  • 2
    Why are you using `preg_split` for this? – Steven Oct 25 '13 at 23:41
  • Also, you have got a space proceeding `bad` in your `$words` array? If the issue is a space.. Have you thought about using [trim](http://php.net/trim) before trying to match? – Steven Oct 25 '13 at 23:45

1 Answers1

0

I think I would take a different approach, as I am not sure why you are using preg_split() and hard-coding your censored words in the regex.

Simply build an array of patterns you want to replace and their replacements and use preg_replace().

// note no space in words or their replacements
$word_replacement_map = array(
    'bad' => 'b*d',
    'alsobad' => 'a*****d'
);
$bad_words = array_keys($word_replacement_map);
$patterns = array_map(function($item) {
    return '/\b' . preg_quote($item) . '\b/u';
}, $bad_words);
$replacements = array_values($replacement_map);
$input_string = 'the string with bad and alsobad words';
$cleaned_string = preg_replace($patterns, $replacements, $input_string);
var_dump($cleaned_string); // the string with b*d and a*****d words

Note if you don't need word-specific replacements you could simply this down to:

// note no space in words
$bad_words = array(
    'bad',
    'alsobad'
);
$replacement = 'censored';
$patterns = array_map(function($item) {
    return '/\b' . preg_quote($item) . '\b/u';
}, $bad_words);
$input_string = 'the string with bad and alsobad words';
$cleaned_string = preg_replace($patterns, $replacement, $input_string);
var_dump($cleaned_string); // the string with censored and censored words

Note here I am using word boundaries in the regex patterns, which should generally meet your needs.

Mike Brant
  • 70,514
  • 10
  • 99
  • 103