Analyze text (lemmatization, edit distance)

Question

I need to analyze the text to exist in it banned words. Suppose the black list is the word: "Forbid". The word has many forms. In the text the word can be, for example: "forbidding", "forbidden", "forbad". To bring the word to the initial form, I use a process lemmatization. Your suggestions?

What about typos?
For example: "F0rb1d". I think use damerau–Levenshtein or another. You suggestions?

And what if the text is written as follows:
"ForbiddenInformation.Privatecorrespondenceofthecompany." OR "F0rb1dden1nformation.Privatecorresp0ndenceofthec0mpany." (yes, without whitespace)

How to solve this problem?
Preferably fast algorithm, because text are processed in real time.
And maybe what some tips to improve performance (how to store, etc)?

Not exact duplicates, but similar [ques](http://stackoverflow.com/questions/246961/algorithm-to-find-similar-text) [tions](http://stackoverflow.com/questions/4067105/detect-duplicated-similar-text-among-large-datasets). — khachik, Apr 03 '11 at 15:34

Aurimas Neverauskas · Accepted Answer · 2011-07-15T13:20:09.327

there're two possible solutions as far as I know algorithms.

You could try to use dynamic programming , LCS (longest common subsequence). It will search original text for the desired word as pattern, I believe it's O(mn):

http://en.wikipedia.org/wiki/Longest_common_subsequence_problem http://www.ics.uci.edu/~eppstein/161/960229.html

Although the easier would be to use text search algorithm. The best I know is KMP and it's O(n). For character comparison you could group them into sets like {i I l(L) 1}, {o O 0} and so on. Yet you could modify this for not matching all letters (forbid -> forbad).

http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm

So now you could compare benefits of these two and yours suggestion.

score 1 · Answer 2 · answered Apr 05 '11 at 09:54

1

You could also use RegEx Matches to check for words. http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx

answered Apr 05 '11 at 09:54

kojoma

313
2
3
12

Analyze text (lemmatization, edit distance)

2 Answers2