The problem I'm trying to solve: I have a million words (multiple languages) and some classes that they classify into as my training corpora. Given the testing corpora of words (which is bound to increase in number over time) I want to get the closest match of each of those words in the training corpora and hence classify that word as the corresponding class of its closest match.
My Solution: Initially, I did this brute force which doesn't scale. Now I'm thinking I build a suffix tree over the concatenation of the training corpora (O(n)) and query the testing corpora (constant time). Trying to do this in python.
I'm looking for tools or packages that get me started or for other more efficient ways to solve the problem at hand. Thanks in advance.
Edit 1: As for how I am finding the closest match, I was thinking a combination of exact match alignment (from the suffix tree) and then for the part of the input string that is left over, I thought of doing a local alignment with affine gap penalty functions.