I have string patterns ('rules'), in 'categories'. e.g.:
Category1
- lorem ipsum dolor sit amet
- consectetur adipiscing elit
- fusce sit amet ante nisi
- lorem ut sem interdum molestie
- suspendisse non lorem ut sem interdum molestie
Category2
- vivamus porta non metus egestas finibus
- nam convallis augue nec laoreet pretium
- turpis velit cursus enim ac suscipit risus turpis in metus
Now, I want to be able to 'categorize' a string based on those rules. Let's say we want to find out which category the string fusce laoreet amet ante nisi
belongs to. My current implementation will use levenshtein distance
implementation and find out that the string mostly 'looks like' fusce sit amet ante nisi
and hence, the category is Category1
.
Let's say we want to categorize vivamus vel lorem imperdiet sit
. Because I put threshold 1/5th of the string length (i.e. the string must be at least 80% similar to its match) on the levenshtein distance
algo, the string will remain 'uncategorized'.
In such case I would continue with the following algorithm ...
From each category, I will extract the 'common words' - i.e. words which repeat between the rules within the category. In way, those are the dominating words in the category. So, we'll have:
Category1
- lorem: 3
- sit: 2
- amet: 2
- sem: 2
- interdum: 2
- molestie: 2
Category2
- metus: 2
- turpis: 2
Now I will split the vivamus vel lorem imperdiet sit
string word by word and I will give each category a value, depending on how many of the string words are present in the category's 'dominating words'. i.e.:
Category1 will have value of 3 (lorem) + 2 (sit), and Category2 will have a value of 0 (no matches between the split words of the string I am categorizing and the dominating words in the category). The highest-value category 'wins'.
In short, my algorithm is:
- Use levenshtein distance with a threshold of allowing 1/5th of the string to change, to find the closest matching rule.
- If it fails, split the string we are categorizing into words and with each word, check how 'dominating' that word is in each category, creating a value for the category. The highest value category is our best guess.
Is there a better way to do this? Do you see a problem with this algorithm? Any suggestions?