3

I need a locality preserving hash function implementation for C# (or possibly an alternative solution). I would like to figure out a way to map strings (i.e. similar gene sequence tokens sometimes of slightly different lengths) into the same buckets using a similarity threshold. For instance, if two gene sequence tokens have a Levenshtein Edit Distance that is < a specified threshold of 5, 10, 25 etc... I would like to assign them to the same bucket / category. However, I cannot use edit distance since the token categories are not known in advance and the calculation is rather overhead intensive. I need a very efficient locality preserving hash function (or alternative solution) which will allow me to determine a bucket closest to the hash value based on the threshold or create a new bucket when a close enough bucket does not exist. So far, I have not even been able one locality preserving hashing function implementation in C#, only publications. I figured I would ask before attempting to write my own.

Jake Drew
  • 2,230
  • 23
  • 29
  • I know so little about your problem that my comment probably doesn't rise to the level of "dumb", but I'm going to throw this out anyway. I'm assuming that your input data has a limited character space (ie, only "ABCDEF"). If you create a point in x-dimensional space, where x is the number of characters in the character space by counting the number of occurrences of each character, then use the distance between points to determine the likelyhood of similarity. Filter the points using a minimum distance threshold to determine pairs that are worth a Levenshtein distance calculation. – William Oct 20 '13 at 06:09
  • Gene sequences typically contain 4 characters (T,A,G, or C). If I could figure out a way to turn this 4 dimensional "point" into a numeric value, this might work. I need to convert the gene token to a number and know what bucket the gene token should be placed in based upon the number. i.e. If the calculated "point" is 10,990 I would just place this value in the closest bucket based on a predetermined sensitivity. If the buckets were separated by 100's, 10,990 would be placed in the 11,000 bucket with no edit distance calculations being performed against any existing buckets. – Jake Drew Oct 20 '13 at 18:37
  • The most important point being that the resulting number must maintain the original sort order of the gene sequence token inputs (or come pretty close). This is so very similar gene sequence tokens get mapped into the same buckets with no distance calculations required. – Jake Drew Oct 20 '13 at 19:46
  • No, my suggestion would not maintain the sort order, unless your order happened to be based on the occurrence count of each character. With the count based point, you would only end up with a measure of possible similarity. – William Oct 21 '13 at 03:14

1 Answers1

0

Some phonetic algorithm (e.g. http://en.wikipedia.org/wiki/Soundex) could help.

It basically converts the word to a array of chars that describe its pronunciation. It can be used for searching for similar words. It is also important ot note that such algorithms are language (human language, not programming language) specific.

Ondra
  • 1,619
  • 13
  • 27
  • I had briefly considered this, but Soundex() does not appear work well for a limited character set like gene sequences. For instance, AAAA = A000, AAAT = A300, TAAA = T000 yet all three are only separated by 1 character. – Jake Drew Oct 20 '13 at 18:20