I am trying to compute hamming distances between each node in a graph of n nodes. Each node in this graph has a label of the same length (k) and the alphabet used for labels is {0, 1, *}. The '*' operates as a don't care symbol. For example, hamming distances between labels 101*01 and 1001*1 is equal to 1 (we say they only differ at the 3rd index).
What I need to do is to find all 1-hamming-distance neighbors of each node and report exactly at which index those two labels differ.
I am comparing each nodes label with all others character by character as follows:
// Given two strings s1, s2
// returns the index of the change if hd(s1,s2)=1, -1 otherwise.
int count = 0;
char c1, c2;
int index = -1;
for (int i = 0; i < k; i++)
{
// do not compute anything for *
c1 = s1.charAt(i);
if (c1 == '*')
continue;
c2 = s2.charAt(i);
if (c2 == '*')
continue;
if (c1 != c2)
{
index = i;
count++;
// if hamming distance is greater than 1, immediately stop
if (count > 1)
{
index = -1;
break;
}
}
}
return index;
I may have a couple of millions nodes. k is usually around 50. I am using JAVA, this comparison takes n*n*k time and operates slow. I considered making use of tries and VP-trees but could not figure out which data structure works for this case. I also studied the Simmetrics library but nothing flashed into my mind. I would really appreciate any suggestions.