1

The problem I'm trying to solve: I have a million words (multiple languages) and some classes that they classify into as my training corpora. Given the testing corpora of words (which is bound to increase in number over time) I want to get the closest match of each of those words in the training corpora and hence classify that word as the corresponding class of its closest match.

My Solution: Initially, I did this brute force which doesn't scale. Now I'm thinking I build a suffix tree over the concatenation of the training corpora (O(n)) and query the testing corpora (constant time). Trying to do this in python.

I'm looking for tools or packages that get me started or for other more efficient ways to solve the problem at hand. Thanks in advance.

Edit 1: As for how I am finding the closest match, I was thinking a combination of exact match alignment (from the suffix tree) and then for the part of the input string that is left over, I thought of doing a local alignment with affine gap penalty functions.

user281989
  • 47
  • 7

1 Answers1

0

What distance metric are you using for the closest match?

There are papers that cover how to do an edit distance search using a suffix tree. For each suffix there is an extension of the edit matrix and theses can be ordered so to let one do a ranked search of the suffix tree to find the matching items in order of increasing distance.

An example for this is Top-k String Similarity Search with Edit-Distance Constraints (2013) https://doi.org/10.1109/ICDE.2013.6544886 https://scholar.google.com/scholar?cluster=13387662751776693983
The solution presented avoids computing all the entries in the table as columns are added.

In your problem it seems that for each word there are classes that apply to them if they don't depend on context then the above would work and a word to class map would all that would be needed. But if they depend on context then this seems closer to part of speech tagging.

Dan D.
  • 73,243
  • 15
  • 104
  • 123
  • I was thinking exact match alignment score + local alignment (for the part of the string that was left over from the exact match) with affine gap penalties as the distance metric. Yes, my problem doesn't depend on context. Just the words themselves. Also, as there are string of multiple languages (not all of which are supported by any of the NLP packages like NLP). As I have a mapping of words to classes I thought this might be a good option. – user281989 Jun 25 '19 at 19:58
  • But an edit distance search also sounds like a good first step. Could you point me to such papers/implementations? – user281989 Jun 25 '19 at 20:22