53

Not very technical, but... I have to implement a bad words filter in a new site we are developing. So I need a "good" bad words list to feed my db with... any hint / direction? Looking around with google I found this one, and it's a start, but nothing more.

Yes, I know that this kind of filters are easily escaped... but the client will is the client will !!! :-)

The site will have to filter out both english and italian words, but for italian I can ask my colleagues to help me with a community-built list of "parolacce" :-) - an email will do.

Thanks for any help.

Keng
  • 52,011
  • 32
  • 81
  • 111
ila
  • 4,706
  • 7
  • 37
  • 41
  • 23
    Obscenity filtering... a bad idea or a really intercoursing bad idea? – stephenbayer Oct 22 '08 at 13:18
  • 1
    team it up with a spellchecker, if you get more spelling errors post-censorship, you've messed up somewhere and can deal with it – nailitdown Sep 02 '10 at 04:28
  • related: http://programmers.stackexchange.com/questions/143405/how-to-generate-language-safe-uuids – David Cary Jun 28 '12 at 03:21
  • 14
    Very few filters can detect the words "Shiτ" and "fucκ", though. Not even StackOverflow. – Theodore R. Smith Aug 02 '13 at 20:25
  • 5
    To everyone saying that this is pointless and/or stupid, consider that this kind of filtering could still be useful as one part of a larger system. Yes, it's probably a bad idea to find/replace or automatically reject based purely on a blacklist, but a filter could be used, for example, to send user-submitted content for manual approval/moderation. Or perhaps it could be be used to warn a user before submission that they may be banned if they post offensive material. – Cam Jackson Aug 09 '13 at 08:28
  • 1
    This is great for web-based educational software to flag student responses that "contain profanity", which can then be relayed to the teachers for review. I created an ASCII folding map, in which I hand-mapped all 65,000+ Unicode code points to their closest visual ASCII equivalent if one exists. I then did the same for all permutations of 2, 3, and 4-character sequences using a visual similarity engine, to collapse them to their nearest single-character equivalent (e..g "\/\/" = "W", "|-|" = "H", "|_" = "L"), and then used an hierarchical temporal memory algorithm to recognize them instantly. – Triynko Feb 10 '15 at 20:14
  • After much munging and collection: https://github.com/alvations/expletives/tree/master – alvas Apr 16 '17 at 06:50
  • Hi @triynko If you are willing to share code I would be interested. Interesting idea. – Robert Lugg Aug 31 '19 at 17:31

9 Answers9

60

Beware of clbuttic mistakes.

"Apple made the clbuttic mistake of forcing out their visionary - I mean, look at what NeXT has been up to!"

Hmm. "clbuttic".

Google "clbuttic" - thousands of hits!

There's someone who call his car 'clbuttic'.

There are "Clbuttic Steam Engine" message boards.

Webster's dictionary - no help.

Hmm. What can this be?

HINT: People who make buttumptions about their regex scripts, will be embarbutted when they repeat this mbuttive mistake.

callisto
  • 4,921
  • 11
  • 51
  • 92
AgentConundrum
  • 20,288
  • 6
  • 64
  • 99
39

I didn't see any language specified but you can use this for PHP it will generate a RegEx for each instered work so that even intentional mis-spellings (i.e. @ss, i3itch ) will also be caught.

<?php

/**
 * @author unkwntech@unkwndesign.com
 **/

if($_GET['act'] == 'do')
 {
    $pattern['a'] = '/[a]/'; $replace['a'] = '[a A @]';
    $pattern['b'] = '/[b]/'; $replace['b'] = '[b B I3 l3 i3]';
    $pattern['c'] = '/[c]/'; $replace['c'] = '(?:[c C (]|[k K])';
    $pattern['d'] = '/[d]/'; $replace['d'] = '[d D]';
    $pattern['e'] = '/[e]/'; $replace['e'] = '[e E 3]';
    $pattern['f'] = '/[f]/'; $replace['f'] = '(?:[f F]|[ph pH Ph PH])';
    $pattern['g'] = '/[g]/'; $replace['g'] = '[g G 6]';
    $pattern['h'] = '/[h]/'; $replace['h'] = '[h H]';
    $pattern['i'] = '/[i]/'; $replace['i'] = '[i I l ! 1]';
    $pattern['j'] = '/[j]/'; $replace['j'] = '[j J]';
    $pattern['k'] = '/[k]/'; $replace['k'] = '(?:[c C (]|[k K])';
    $pattern['l'] = '/[l]/'; $replace['l'] = '[l L 1 ! i]';
    $pattern['m'] = '/[m]/'; $replace['m'] = '[m M]';
    $pattern['n'] = '/[n]/'; $replace['n'] = '[n N]';
    $pattern['o'] = '/[o]/'; $replace['o'] = '[o O 0]';
    $pattern['p'] = '/[p]/'; $replace['p'] = '[p P]';
    $pattern['q'] = '/[q]/'; $replace['q'] = '[q Q 9]';
    $pattern['r'] = '/[r]/'; $replace['r'] = '[r R]';
    $pattern['s'] = '/[s]/'; $replace['s'] = '[s S $ 5]';
    $pattern['t'] = '/[t]/'; $replace['t'] = '[t T 7]';
    $pattern['u'] = '/[u]/'; $replace['u'] = '[u U v V]';
    $pattern['v'] = '/[v]/'; $replace['v'] = '[v V u U]';
    $pattern['w'] = '/[w]/'; $replace['w'] = '[w W vv VV]';
    $pattern['x'] = '/[x]/'; $replace['x'] = '[x X]';
    $pattern['y'] = '/[y]/'; $replace['y'] = '[y Y]';
    $pattern['z'] = '/[z]/'; $replace['z'] = '[z Z 2]';
    $word = str_split(strtolower($_POST['word']));
    $i=0;
    while($i < count($word))
     {
        if(!is_numeric($word[$i]))
         {
            if($word[$i] != ' ' || count($word[$i]) < '1')
             {
                $word[$i] = preg_replace($pattern[$word[$i]], $replace[$word[$i]], $word[$i]);
             }
         }
        $i++;
     }
    //$word = "/" . implode('', $word) . "/";
    echo implode('', $word);
 }

if($_GET['act'] == 'list')
 {
    $link = mysql_connect('localhost', 'username', 'password', '1');
    mysql_select_db('peoples');
    $sql = "SELECT word FROM filters";
    $result = mysql_query($sql, $link);
    $i=0;
    while($i < mysql_num_rows($result))
     {
        echo mysql_result($result, $i, 'word') . "<br />";
        $i++;
     }
     echo '<hr>';
 }
?>
<html>
    <head>
        <title>RegEx Generator</title>
    </head>
    <body>
        <form action='badword.php?act=do' method='post'>
            Word: <input type='text' name='word' /><br />
            <input type='submit' value='Generate' />
        </form>
        <a href="badword.php?act=list">List Words</a>
    </body>
</html>
UnkwnTech
  • 88,102
  • 65
  • 184
  • 229
  • 15
    On't-day orget-day ig-pay atin-lay. Urse-cay ords-way are-ar ill-st ite-quay eadable-ray. (former owner of the AOL nick Itshay). – plinth May 12 '09 at 01:07
  • 9
    you mean "On't-day orget-fay" – Raiyan May 22 '15 at 16:28
  • This is a great reference, thank you for that. In application, however, I'm not sure how changing "hamburger" to "[h H][a A @][m M][b B I3 l3 i3][u U v V][r R][g G 6][e E 3][r R]" is going to help filter profanity. – JVE999 Feb 24 '20 at 17:21
  • @JVE999 Sometimes users attempt to bypass bad word filters by using other characters instead of the conventional letters; instead of A, one could use @ to say a bad word. However, it may also help by including lower- and upper-case characters. For example, you have a database of bad words, from which you pass this code on, and use it to detect a bad word even if it was misspelled or tweaked through these means. – LeWolfie Dec 12 '21 at 22:44
39

Shutterstock has a Github repo with a list of bad words used for filtering.

You can check it out here: https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

David Fraga
  • 391
  • 3
  • 3
7

If anyone needs an API, google currently provide a bad word indicator.

http://www.wdyl.com/profanity?q=naughtyword

{
response: "false"
}

Update: Google has now removed this service.

Tony
  • 439
  • 11
  • 20
  • 8
    Doesn't seem to be active anymore. – Nick Jun 14 '16 at 12:36
  • 1
    Seeing as that list is down, https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt is an option. – Tony Sep 11 '19 at 12:40
4

I would say to just remove posts as you become aware of them, and block users who are overly explicit with their postings. You can say very offensive things without using any swear words. If you block the word ass (aka donkey), then people will just type a$$ or /\55, or whatever else they need to type to get past the filter.

Kibbee
  • 65,369
  • 27
  • 142
  • 182
4

+1 on the Clbuttic mistake, I think it is important for "bad word" filters to scan for both leading and trailing spaces (e.g., " ass ") as opposed for just the exact string so that we won't have words like clbuttic, clbuttes, buttert, buttess, etc.

Jon Limjap
  • 94,284
  • 15
  • 101
  • 152
  • 4
    And don't block the town of Scunthorpe. – TRiG Nov 25 '09 at 19:01
  • Unfortunately, that doesn't get rid of curses at the beginning of a paragraph or near punctuation. If I had a paragraph that consisted of "(badword)!", it would fail your test. – proudgeekdad Feb 22 '11 at 21:51
2

Wikipedia ClueBot has a bad word filter, read its source.

http://en.wikipedia.org/wiki/User:ClueBot/Source#Score_list

Ming-Tang
  • 17,410
  • 8
  • 38
  • 76
1

You could always convince the client to have a session of users just constantly posting expletives and make an easy solution to add them to the system. It is a lot of work but it will probably be more representative of the community.

Ross
  • 46,186
  • 39
  • 120
  • 173
-2

In researching this topic I determined that what was needed was more than just a list that does arbitrary replacements. I have built a web service that allows you to identify the level of 'cleanliness' you desire. It also makes an effort to identify false positives - i.e. where a word may be bad in one context but not in others. Take a look at http://filterlanguage.com

Richard
  • 11