We want to create an API that takes an input and identifies if it's an abusive word or not. I read about word profanity filter but can't get a satisfactory solution to check this. There are a couple of challenges like:
- The word "SUCK" which is considered as abusive can be written as SUUCK, SUCCK, SU C K or in many other ways. The words could be separated by any special character OR wrong spelling might be used but with the use of similar sounding word
- Multi-lingual: Abusive words could be written in any language.
How can we identify this? I read Comparing strings with tolerance to get an idea where strings can be compared based on their similarity.
But this is something that many organizations must be worried about esp. chat etc. and there should be some way to identify such language. Can I get any reference for this? And how can we block similar sounding words OR where just 1 or characters are missing but they are very similar to any abusive word.