1

Possible Duplicate:
“bad words” filter

In my web application i have a section which gets input from the user and posting it on the main page.

i would like to prevent post in dirty language.

Is there any research about it or a library in php that detects most of the curses and dirty expressions in English.

shortly speaking i would like to test the input in that manner;

if the input in the set of the unwanted patterns 
      dob't publish it
else
      publish it on the main wall
Community
  • 1
  • 1
0x90
  • 39,472
  • 36
  • 165
  • 245
  • 2
    related: [Scunthorpe problem](http://en.wikipedia.org/wiki/Scunthorpe_problem) – Gumbo Dec 31 '11 at 09:58
  • 1
    it is not an exact duplicate, here he want to filter also expressions (ngram). I think this topic should not be close – JohnJohnGa Dec 31 '11 at 10:13

2 Answers2

2

Honestly? There's no reliable way to programatically censor a post. If someone from Scunthorpe was to post about their recent trip to the town of Effin and how much they love to listen to the music of Jarvis Cocker whilst giving their Shitzu a groom then that's probably going to trigger any swear filter you implement. What's more, if you leave a word off your list it will get through.

You could use some sort of filter to flag posts for review by a human moderator, but depending on an entirely automated process isn't going to work.

GordonM
  • 31,179
  • 15
  • 87
  • 129
  • 1
    I totally disagree. it is an information retrieval problem. Google and yahoo dealt with this kind of issue since many years and you can use probabilities between single words or ngrams to solve this issue. – JohnJohnGa Dec 31 '11 at 10:10
  • Problems of this nature still arise anyway, so it's obviously not foolproof. Therefore I stand by my original answer. This month alone has seen two stories in the news about automated filtering causing problems (A woman couldn't set her home town to Effin on Facebook, and the Programme Guide on Virgin cable TV started censoring show names such as Never Mind the Buzzcocks) – GordonM Dec 31 '11 at 10:13
1

It must be based on a dictionnary. First you will need a static list of dirty words. Then you will be able by finding all the top collocations related to a single dirty words, to find all the possible dirty expression but you will need a large set of documents.

JohnJohnGa
  • 15,446
  • 19
  • 62
  • 87