9

I'm working on a WordPress plugin that replaces the bad words from the comments with random new ones from a list.

I now have 2 arrays: one containing the bad words and another containing the good words.

$bad = array("bad", "words", "here");
$good = array("good", "words", "here");

Since I'm a beginner, I got stuck at some point.

In order to replace the bad words, I've been using $newstring = str_replace($bad, $good, $string);.

My first problem is that I want to turn off the case sensivity, so I won't put the words like this "bad", "Bad", "BAD", "bAd", "BAd", etc but I need the new word to keep the format of the original word, for example if I write "Bad", it would be replaced with "Words", but if I type "bad", it would be replaced with "words", etc.

My first tought was to use str_ireplace, but it forgets if the original word had a capital letter.

The second problem is that I don't know how to deal with the users that type like this: "b a d", "w o r d s", etc. I need an idea.

In order to make it select a random word, I think I can use $new = $good[rand(0, count($good)-1)]; then $newstring = str_replace($bad, $new, $string);. If you have a better idea, I'm here to listen.

The general look of my script:

function noswear($string)
{
    if ($string)
    {       
        $bad = array("bad", "words");
        $good = array("good", "words"); 
        $newstring = str_replace($bad, $good, $string);     
        return $newstring;
}

echo noswear("I see bad words coming!");

Thank you in advance for your help!

Ilia Ross
  • 13,086
  • 11
  • 53
  • 88
Rawrrr1337
  • 251
  • 5
  • 13

2 Answers2

11

Precursor

There are (as has been pointed out in the comments numerous times) gaping holes for you - and/or your code - to fall into through implementing such a feature, to name but a few:

  1. People will add characters to fool the filter
  2. People will become creative (e.g. innuendo)
  3. People will use passive aggression and sarcasm
  4. People will use sentences/phrases not just words

You'd do better to implement a moderation/flagging system where people can flag offensive comments which can then be edited/removed by mods, users, etc.

On that understanding, let us proceed...

Solution

Given that you:

  1. Have a forbidden word list $bad_words
  2. Have a replacement word list $good_words
  3. Want to replace bad words regardless of case
  4. Want to replace bad words with random good words
  5. Have a correctly escaped bad word list: see http://php.net/preg_quote

You can very easily use PHPs preg_replace_callback function:

$input_string = 'This Could be interesting but should it be? Perhaps this \'would\' work; or couldn\'t it?';

$bad_words  = array('could', 'would', 'should');
$good_words = array('might', 'will');

function replace_words($matches){
    global $good_words;
    return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
}

echo preg_replace_callback('/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i', 'replace_words', $input_string);

Okay, so what the preg_replace_callback does is it compiles a regex pattern consisting of all of the bad words. Matches will then be in the format:

/(START OR WORD_BOUNDARY OR WHITE_SPACE)(BAD_WORD)(WORD_BOUNDARY OR WHITE_SPACE OR END)/i

The i modifier makes it case insensitive so both bad and Bad would match.

The function replace_words then takes the matched word and it's boundaries (either blank or a white space character) and replaces it with the boundaries and a random good word.

global $good_words; <-- Makes the $good_words variable accessible from within the function
$matches[1] <-- The word boundary before the matched word
$matches[3] <-- The word boundary after  the matched word
$good_words[rand(0, count($good_words)-1] <-- Selects a random good word from $good_words

Anonymous function

You could rewrite the above as a one liner using an anonymous function in the preg_replace_callback

echo preg_replace_callback(
        '/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i',
        function ($matches) use ($good_words){
            return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
        },
        $input_string
    );

Function wrapper

If you're going to use it multiple times you may also write it as a self-contained function, although in this case you're most likely going to want to feed the good/bad words in to the function when calling it (or hard code them in there permanently) but that depends on how you derive them...

function clean_string($input_string, $bad_words, $good_words){
    return preg_replace_callback(
        '/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i',
        function ($matches) use ($good_words){
            return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
        },
        $input_string
    );
}

echo clean_string($input_string, $bad_words, $good_words);

Output

Running the above functions consecutively with the input and word lists shown in the first example:

This will be interesting but might it be? Perhaps this 'will' work; or couldn't it?
This might be interesting but might it be? Perhaps this 'might' work; or couldn't it?
This might be interesting but will it be? Perhaps this 'will' work; or couldn't it?

Of course the replacement words are chosen randomly so if I refreshed the page I'd get something else... But this shows what does/doesn't get replaced.

N.B.

Escaping $bad_words

foreach($bad_words as $key=>$word){
    $bad_words[$key] = preg_quote($word);
}

Word boundaries \b

In this code I've used \b, \s, and ^ or $ as word boundaries there is a good reason for this. While white space, start of string, and end of string are all considered word boundaries \b will not match in all cases, for example:

\b\$h1t\b <---Will not match

This is because \b matches against non-word characters (i.e. [^a-zA-Z0-9]) and characters like $ don't count as word characters.

Misc

Depending on the size of your word list there are a couple of potential hiccups. From a system design perspective it's generally bad form to have huge regexes for a couple of reasons:

  1. It can be difficult to maintain
  2. It's difficult to read/understand what it does
  3. It's difficult to find errors
  4. It can be memory intensive if the list is too large

Given that the regex pattern is compiled by PHP the first reason is negated. The second should be negated as well; if you're word list is large with a dozen permutations of each bad word then I suggest you stop and rethink your approach (read: use a flagging/moderation system).

To clarify, I don't see a problem have a small word list to filter out specific expletives as it serves a purpose: to stop users from having an outburst at one another; the problem comes when you try to filter out too much including permutations. Stick to filtering common swear words and if that doesn't work then - for the last time - implement a flagging/moderation system.

Steven
  • 6,053
  • 2
  • 16
  • 28
  • I would also advise heavily against globals: you don't need it for your closure (`function($matches) use($bad_words,$good_words){}` works fine), and one is far better of with predictable, reliable and testable code if the function gets those lists as _arguments_. – Wrikken Oct 14 '13 at 22:57
  • I must be reading this _out of context_ "I would also"; Is there a previous comment which I've missed?? – Steven Oct 14 '13 at 23:51
  • As for the use of `global`: while I agree that you could use `use` with the anonymous function you couldn't use it in the way that you're suggesting for a couple of reasons one being purely because `$bad_words` is not passed in. However, `function ($matches) use($good_words){}` _could_ be used in the **Anonymous Function** example above. But in the function `clean_string` to do the same you would have to reference `$good_words` in the parent/named function (as `use` only takes from the parent). – Steven Oct 15 '13 at 00:01
  • In this case it's simpler and more readable to simply use the `global` keyword. Otherwise you would have to pass in word list arguments every time you called the function (e.g. `clean_string($input_string, $bad_words, $good_words)`) which might not be a huge pain but certainly isn't necessary, having said that I did mention that feeding the word lists as arguments is something that (would be preferred) if it was to be turned into a function. – Steven Oct 15 '13 at 00:09
  • I also suggested that the OP may instead initialise them in the function (as I assume the lists will come from files or database tables) in which case neither `global` nor `use` would be used. – Steven Oct 15 '13 at 00:12
  • yes, the _also_ was about a deleted portion where I was dead-wrong (the `\b` part). But: using globals, _especially_ when talking to developing coders (which I'm sure you know are all to copy/paste happy), _even as an example_ is IMHO not a good idea. Making them arguments, feeding them to a closure, making it a tiny class with properties and a method, even if we insist on wanting a global state static variables & a static method in a class, all equally easy to incorporate in the examples. And it prevents the allure of just using it _as is_, with the globals. – Wrikken Oct 15 '13 at 00:16
  • 1
    @Wrikken I concede your point regarding copy/paste. In all fairness, I copy/pasted the original `replace_words` function into the later two examples and just didn't update it... Anyway, given that it was a 10 second fix (essentially copy/pasting a few lines in different places) I have updated the code to _not use_ the `global` key word and instead require the word lists to be passed in as arguments. – Steven Oct 15 '13 at 07:38
  • 2
    Excellent, my thanks, and possibly the thanks of any future colleagues of new coders reading it ;) – Wrikken Oct 15 '13 at 07:40
5

I came up to this method and it's working fine. Returning true, in case there is an entry of bad words in the entry.

Example:

function badWordsFilter($inputWord) {
  $badWords = Array("bad","words","here");
  for($i=0;$i<count($badWords);$i++) {
     if($badWords[$i] == strtolower($inputWord))
        return true;
     }
  return false;
}

Usage:

if (badWordsFilter("bad")) {
    echo "Bad word was found";
} else {
    echo "No bad words detected";
}

As the word 'bad' is blacklisted it will echo.

Online example 1

EDIT 1:

As offered by rid it's also possible to do simple in_array check:

function badWordsFilter($inputWord) {
  $badWords = Array("bad","words","here");
     if(in_array(strtolower($inputWord), $badWords) ) {
        return true;
     }
  return false;
}

Online example 2

EDIT 2:

As I promised, I came up to the slightly different idea of replacing bad words with good words, as you mentioned in your question. I hope it will help you a bit but this is the best I can offer at the moment, as I'm totally not sure on what you're trying to do.

Example:

1. Let's combine an array with bad and good words into one

$wordsTransform = array(
  'shit' => 'ship'
);

2. Your imaginary user input

$string = "Rolling In The Deep by Adel\n
\n
There's a fire starting in my heart\n
Reaching a fever pitch, and it's bringing me out the dark\n
Finally I can see you crystal clear\n
Go ahead and sell me out and I'll lay your shit bare";

3. Replacing bad words with good words

$string = strtr($string, $wordsTransform);

4. Getting the desired output

Rolling In The Deep

There's a fire starting in my heart
Reaching a fever pitch, and it's bringing me out the dark
Finally I can see you crystal clear
Go ahead and sell me out and I'll lay your ship bare

Online example 3

EDIT 3:

To follow the correct comment from Wrikken, I have totally forgotten about that strtr is case sensitive and that it's better to follow word-boundary. I have borrowed the following example from
PHP: strtr - Manual and modified it slightly.

Same idea as in my second edit but not register dependent, it checks for word boundaries and puts a backslash in front of every character that is part of the regular expression syntax:

1. Method:

//
// Written by Patrick Rauchfuss
class String
{
    public static function stritr(&$string, $from, $to = NULL)
    {
        if(is_string($from))
            $string = preg_replace("/\b{$from}\b/i", $to, $string);

        else if(is_array($from))
        {
            foreach ($from as $key => $val)
                self::stritr($string, $key, $val);
        }
        return preg_quote($string); // return and add a backslash to special characters
    }
}

2. An array with bad and good words

$wordsTransform = array(
            'shit' => 'ship'
        );

3. Replacement

String::stritr($string, $wordsTransform);

Online example 4

Ilia Ross
  • 13,086
  • 11
  • 53
  • 88
  • 1
    try `strtolower` on the `$inputWord` so that it gets around the issue of having 'bAd' getting through... he also wants this to strip bad words from a comment not singular word, so you would have to explode the `$inputWord` into an array and then check each value – rorypicko Oct 14 '13 at 11:32
  • 1
    @rid it could be replaced with `in_array()` although this then removes the ease of extending it with regular expression checks for example – rorypicko Oct 14 '13 at 11:35
  • Thank you, but I think it's not what I actually need. My script needs to replace the bad words with another random word from an array, but your script only tells me if there is a bad word or not. – Rawrrr1337 Oct 14 '13 at 11:48
  • Understood! What would you like your words to be replaced with? Why not just replace them with `...`? – Ilia Ross Oct 14 '13 at 11:50
  • I need to replace the bad words with another random words from an array, since I'll have a database with the good and bad words and the arrays may not have equal number of elements, so for each bad word I will take do something like this: `$new = $good[rand(0, count($good)-1)];` to get a random value, then I will replace the bad word with `$new`. – Rawrrr1337 Oct 14 '13 at 11:56
  • I will be going to be away soon! I will come back to you in the evening and try to play with your problem in case it wouldn't be solved by then! – Ilia Ross Oct 14 '13 at 11:59
  • For the love of... at _the very least_ check for word boundaries when you must to this ([there's a reason most comments advise against it](http://thedailywtf.com/Articles/The-Clbuttic-Mistake-.aspx)). Something like `/\b$word\b/i` in a regex. – Wrikken Oct 14 '13 at 17:54
  • @Wrikken I updated my answer to check for word boundaries! Sorry! You're right! Now it would catch it anyway! – Ilia Ross Oct 14 '13 at 18:30
  • That is _still_ no **word-boundary**. – Wrikken Oct 14 '13 at 19:32
  • @Wrikken I understood and finally fixed it! Thank you for pointing out useful point! – Ilia Ross Oct 14 '13 at 19:47
  • And, depending on what the words contain (the often used `$h1t` for instance), use `preg_quote`... – Wrikken Oct 14 '13 at 19:49