"bad words" filter

Question

Not very technical, but... I have to implement a bad words filter in a new site we are developing. So I need a "good" bad words list to feed my db with... any hint / direction? Looking around with google I found this one, and it's a start, but nothing more.

Yes, I know that this kind of filters are easily escaped... but the client will is the client will !!! :-)

The site will have to filter out both english and italian words, but for italian I can ask my colleagues to help me with a community-built list of "parolacce" :-) - an email will do.

Thanks for any help.

Obscenity filtering... a bad idea or a really intercoursing bad idea? — stephenbayer, Oct 22 '08 at 13:18
team it up with a spellchecker, if you get more spelling errors post-censorship, you've messed up somewhere and can deal with it — nailitdown, Sep 02 '10 at 04:28
related: http://programmers.stackexchange.com/questions/143405/how-to-generate-language-safe-uuids — David Cary, Jun 28 '12 at 03:21
Very few filters can detect the words "Shiτ" and "fucκ", though. Not even StackOverflow. — Theodore R. Smith, Aug 02 '13 at 20:25
To everyone saying that this is pointless and/or stupid, consider that this kind of filtering could still be useful as one part of a larger system. Yes, it's probably a bad idea to find/replace or automatically reject based purely on a blacklist, but a filter could be used, for example, to send user-submitted content for manual approval/moderation. Or perhaps it could be be used to warn a user before submission that they may be banned if they post offensive material. — Cam Jackson, Aug 09 '13 at 08:28
This is great for web-based educational software to flag student responses that "contain profanity", which can then be relayed to the teachers for review. I created an ASCII folding map, in which I hand-mapped all 65,000+ Unicode code points to their closest visual ASCII equivalent if one exists. I then did the same for all permutations of 2, 3, and 4-character sequences using a visual similarity engine, to collapse them to their nearest single-character equivalent (e..g "\/\/" = "W", "|-|" = "H", "|_" = "L"), and then used an hierarchical temporal memory algorithm to recognize them instantly. — Triynko, Feb 10 '15 at 20:14
After much munging and collection: https://github.com/alvations/expletives/tree/master — alvas, Apr 16 '17 at 06:50
Hi @triynko If you are willing to share code I would be interested. Interesting idea. — Robert Lugg, Aug 31 '19 at 17:31

score 60 · Answer 1 · edited Apr 05 '18 at 20:30

Beware of clbuttic mistakes.

"Apple made the clbuttic mistake of forcing out their visionary - I mean, look at what NeXT has been up to!"

Hmm. "clbuttic".

Google "clbuttic" - thousands of hits!

There's someone who call his car 'clbuttic'.

There are "Clbuttic Steam Engine" message boards.

Webster's dictionary - no help.

Hmm. What can this be?

HINT: People who make buttumptions about their regex scripts, will be embarbutted when they repeat this mbuttive mistake.

UnkwnTech · Accepted Answer · 2011-07-24T17:41:18.370

I didn't see any language specified but you can use this for PHP it will generate a RegEx for each instered work so that even intentional mis-spellings (i.e. @ss, i3itch ) will also be caught.

<?php

/**
 * @author unkwntech@unkwndesign.com
 **/

if($_GET['act'] == 'do')
 {
    $pattern['a'] = '/[a]/'; $replace['a'] = '[a A @]';
    $pattern['b'] = '/[b]/'; $replace['b'] = '[b B I3 l3 i3]';
    $pattern['c'] = '/[c]/'; $replace['c'] = '(?:[c C (]|[k K])';
    $pattern['d'] = '/[d]/'; $replace['d'] = '[d D]';
    $pattern['e'] = '/[e]/'; $replace['e'] = '[e E 3]';
    $pattern['f'] = '/[f]/'; $replace['f'] = '(?:[f F]|[ph pH Ph PH])';
    $pattern['g'] = '/[g]/'; $replace['g'] = '[g G 6]';
    $pattern['h'] = '/[h]/'; $replace['h'] = '[h H]';
    $pattern['i'] = '/[i]/'; $replace['i'] = '[i I l ! 1]';
    $pattern['j'] = '/[j]/'; $replace['j'] = '[j J]';
    $pattern['k'] = '/[k]/'; $replace['k'] = '(?:[c C (]|[k K])';
    $pattern['l'] = '/[l]/'; $replace['l'] = '[l L 1 ! i]';
    $pattern['m'] = '/[m]/'; $replace['m'] = '[m M]';
    $pattern['n'] = '/[n]/'; $replace['n'] = '[n N]';
    $pattern['o'] = '/[o]/'; $replace['o'] = '[o O 0]';
    $pattern['p'] = '/[p]/'; $replace['p'] = '[p P]';
    $pattern['q'] = '/[q]/'; $replace['q'] = '[q Q 9]';
    $pattern['r'] = '/[r]/'; $replace['r'] = '[r R]';
    $pattern['s'] = '/[s]/'; $replace['s'] = '[s S $ 5]';
    $pattern['t'] = '/[t]/'; $replace['t'] = '[t T 7]';
    $pattern['u'] = '/[u]/'; $replace['u'] = '[u U v V]';
    $pattern['v'] = '/[v]/'; $replace['v'] = '[v V u U]';
    $pattern['w'] = '/[w]/'; $replace['w'] = '[w W vv VV]';
    $pattern['x'] = '/[x]/'; $replace['x'] = '[x X]';
    $pattern['y'] = '/[y]/'; $replace['y'] = '[y Y]';
    $pattern['z'] = '/[z]/'; $replace['z'] = '[z Z 2]';
    $word = str_split(strtolower($_POST['word']));
    $i=0;
    while($i < count($word))
     {
        if(!is_numeric($word[$i]))
         {
            if($word[$i] != ' ' || count($word[$i]) < '1')
             {
                $word[$i] = preg_replace($pattern[$word[$i]], $replace[$word[$i]], $word[$i]);
             }
         }
        $i++;
     }
    //$word = "/" . implode('', $word) . "/";
    echo implode('', $word);
 }

if($_GET['act'] == 'list')
 {
    $link = mysql_connect('localhost', 'username', 'password', '1');
    mysql_select_db('peoples');
    $sql = "SELECT word FROM filters";
    $result = mysql_query($sql, $link);
    $i=0;
    while($i < mysql_num_rows($result))
     {
        echo mysql_result($result, $i, 'word') . "<br />";
        $i++;
     }
     echo '<hr>';
 }
?>
<html>
    <head>
        <title>RegEx Generator</title>
    </head>
    <body>
        <form action='badword.php?act=do' method='post'>
            Word: <input type='text' name='word' /><br />
            <input type='submit' value='Generate' />
        </form>
        <a href="badword.php?act=list">List Words</a>
    </body>
</html>

On't-day orget-day ig-pay atin-lay. Urse-cay ords-way are-ar ill-st ite-quay eadable-ray. (former owner of the AOL nick Itshay). — plinth, May 12 '09 at 01:07
This is a great reference, thank you for that. In application, however, I'm not sure how changing "hamburger" to "[h H][a A @][m M][b B I3 l3 i3][u U v V][r R][g G 6][e E 3][r R]" is going to help filter profanity. — JVE999, Feb 24 '20 at 17:21
@JVE999 Sometimes users attempt to bypass bad word filters by using other characters instead of the conventional letters; instead of A, one could use @ to say a bad word. However, it may also help by including lower- and upper-case characters. For example, you have a database of bad words, from which you pass this code on, and use it to detect a bad word even if it was misspelled or tweaked through these means. — LeWolfie, Dec 12 '21 at 22:44

score 39 · Answer 3 · answered Mar 09 '12 at 05:28

39

Shutterstock has a Github repo with a list of bad words used for filtering.

You can check it out here: https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

answered Mar 09 '12 at 05:28

David Fraga

391
3
3

3

It's a bit much though - "Mr Hands" is offensive apparently. – UpTheCreek Sep 17 '15 at 18:08
3

The french DB is bad... – Cocorico Feb 05 '16 at 09:25

Tony · Answer 4 · 2017-01-25T07:45:40.197

7

If anyone needs an API, google currently provide a bad word indicator.

http://www.wdyl.com/profanity?q=naughtyword

{
response: "false"
}

Update: Google has now removed this service.

edited Jan 25 '17 at 07:45

answered Aug 03 '12 at 18:52

Tony

439
11
20

8

Doesn't seem to be active anymore. – Nick Jun 14 '16 at 12:36
1

Seeing as that list is down, https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt is an option. – Tony Sep 11 '19 at 12:40

score 4 · Answer 5 · answered Aug 24 '08 at 01:23

I would say to just remove posts as you become aware of them, and block users who are overly explicit with their postings. You can say very offensive things without using any swear words. If you block the word ass (aka donkey), then people will just type a$$ or /\55, or whatever else they need to type to get past the filter.

score 4 · Answer 6 · answered Aug 30 '08 at 08:21

4

+1 on the Clbuttic mistake, I think it is important for "bad word" filters to scan for both leading and trailing spaces (e.g., " ass ") as opposed for just the exact string so that we won't have words like clbuttic, clbuttes, buttert, buttess, etc.

answered Aug 30 '08 at 08:21

Jon Limjap

94,284
15
101
152

4

And don't block the town of Scunthorpe. – TRiG Nov 25 '09 at 19:01
Unfortunately, that doesn't get rid of curses at the beginning of a paragraph or near punctuation. If I had a paragraph that consisted of "(badword)!", it would fail your test. – proudgeekdad Feb 22 '11 at 21:51

score 2 · Answer 7 · answered Sep 02 '10 at 04:29

2

Wikipedia ClueBot has a bad word filter, read its source.

http://en.wikipedia.org/wiki/User:ClueBot/Source#Score_list

answered Sep 02 '10 at 04:29

Ming-Tang

17,410
8
38
76

score 1 · Answer 8 · answered Aug 23 '08 at 22:03

1

You could always convince the client to have a session of users just constantly posting expletives and make an easy solution to add them to the system. It is a lot of work but it will probably be more representative of the community.

answered Aug 23 '08 at 22:03

Ross

46,186
39
120
173

score -2 · Answer 9 · answered Sep 02 '10 at 04:23

-2

In researching this topic I determined that what was needed was more than just a list that does arbitrary replacements. I have built a web service that allows you to identify the level of 'cleanliness' you desire. It also makes an effort to identify false positives - i.e. where a word may be bad in one context but not in others. Take a look at http://filterlanguage.com

answered Sep 02 '10 at 04:23

Richard

11

1

The url was unreachable. – Lenin Dec 13 '12 at 11:44

"bad words" filter

9 Answers9

Linked

Related