7

I need to analyze the text to exist in it banned words. Suppose the black list is the word: "Forbid". The word has many forms. In the text the word can be, for example: "forbidding", "forbidden", "forbad". To bring the word to the initial form, I use a process lemmatization. Your suggestions?

What about typos?
For example: "F0rb1d". I think use damerau–Levenshtein or another. You suggestions?

And what if the text is written as follows:
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR "F0rb1dden1nformation.Privatecorresp0ndenceofthec0mpany." (yes, without whitespace)

How to solve this problem?
Preferably fast algorithm, because text are processed in real time.
And maybe what some tips to improve performance (how to store, etc)?

Alexei - check Codidact
  • 22,016
  • 16
  • 145
  • 164
user348173
  • 8,818
  • 18
  • 66
  • 102
  • Not exact duplicates, but similar [ques](http://stackoverflow.com/questions/246961/algorithm-to-find-similar-text) [tions](http://stackoverflow.com/questions/4067105/detect-duplicated-similar-text-among-large-datasets). – khachik Apr 03 '11 at 15:34

2 Answers2

3

there're two possible solutions as far as I know algorithms.

You could try to use dynamic programming , LCS (longest common subsequence). It will search original text for the desired word as pattern, I believe it's O(mn):

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem http://www.ics.uci.edu/~eppstein/161/960229.html

Although the easier would be to use text search algorithm. The best I know is KMP and it's O(n). For character comparison you could group them into sets like {i I l(L) 1}, {o O 0} and so on. Yet you could modify this for not matching all letters (forbid -> forbad).

http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm

So now you could compare benefits of these two and yours suggestion.

1

You could also use RegEx Matches to check for words. http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx

kojoma
  • 313
  • 2
  • 3
  • 12