4

I am trying to compute hamming distances between each node in a graph of n nodes. Each node in this graph has a label of the same length (k) and the alphabet used for labels is {0, 1, *}. The '*' operates as a don't care symbol. For example, hamming distances between labels 101*01 and 1001*1 is equal to 1 (we say they only differ at the 3rd index).

What I need to do is to find all 1-hamming-distance neighbors of each node and report exactly at which index those two labels differ.

I am comparing each nodes label with all others character by character as follows:

    // Given two strings s1, s2
    // returns the index of the change if hd(s1,s2)=1, -1 otherwise.

    int count = 0;
    char c1, c2;
    int index = -1;

    for (int i = 0; i < k; i++)
    {
        // do not compute anything for *
        c1 = s1.charAt(i);
        if (c1 == '*')
            continue;

        c2 = s2.charAt(i);
        if (c2 == '*')
            continue;

        if (c1 != c2)
        {
            index = i;
            count++;

            // if hamming distance is greater than 1, immediately stop
            if (count > 1)
            {
                index = -1;
                break;
            }
        }
    }
    return index;

I may have a couple of millions nodes. k is usually around 50. I am using JAVA, this comparison takes n*n*k time and operates slow. I considered making use of tries and VP-trees but could not figure out which data structure works for this case. I also studied the Simmetrics library but nothing flashed into my mind. I would really appreciate any suggestions.

Kara
  • 6,115
  • 16
  • 50
  • 57
begumgenc
  • 393
  • 1
  • 2
  • 12

3 Answers3

1

Try this approach:

Convert the keys into ternary numbers (base 3). i.e. 0=0, 1=1, *=2 10 digits ternary give you a range of 0..59049 which fits in 16 bits.

That means two of those would form a 32 bit word. Create a lookup table with 4 billion entries that return the distance between those two 10 digit ternary words.

You can now use the lookup table to check 10 characters of the key with one lookup. If you use 5 characters, then 3^5 gives you 243 values which would fit into one byte, so the lookup table would only be 64 KB.

By using shift operations, you can create lookup tables of different sizes to balance memory and speed.

That way, you can optimize the loop to abort much more quickly.

To get the position of the first difference, you can use a second lookup table which contains the index of the first difference for two key substrings.

If you have millions of nodes, then you will have many that start with the same substring. Try to sort them into buckets where one bucket contains nodes that start with the same key. The goal here is to make the buckets as small as possible (to reduce the n*n).

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • Thanks for the answer. I am thinking about how to use the lookup tables. It has advantages only if I precompute it. Because I only need the distance once. However, the string size (k) is not a fixed number. For my tests it has a value between 4 and 75. I do not know how to decide on the size. Additionally, bucket sort also has a worst case performance of n*n. Am I missing a point? – begumgenc Apr 10 '15 at 10:53
  • You can pad shorter keys with `0` at the end; that keeps the distance and the index of the first change the same. For longer keys, you need to split them and do several lookups. – Aaron Digulla Apr 10 '15 at 11:00
  • For the bucket sort, you can use a hash map where the key is the length of the node key (if the lengths of two node keys are different by more than 1 digit, then they must have a hamming distance of over 2). In those buckets, you can then sort/hash by the first M digits of the key. That should give you O(N) for the bucket preparation. – Aaron Digulla Apr 10 '15 at 11:02
  • I tried using a lookup table but due to memory restrictions I was able to create a lookup table for 5digit labels. Unfortunately, this operation seems 2-3 times slower than the previous one (time for creating the lookup table is not included). – begumgenc Apr 11 '15 at 18:47
  • Try to convert the node keys only once. Run a profiler to see where the time is spent. – Aaron Digulla Apr 14 '15 at 08:13
1

Instead of / additional to the string, store a mask for 1 bits and a mask for * bits. One could use BitSet, but let's try without.

static int mask(String value, char digit) {
    int mask = 0;
    int bit = 2; // Start with bits[1] as per specification.
    for (int i = 0; i < value.length(); ++i) {
        if (value.charAt(i) == digit) {
            mask |= bit;
        }
        bit <<= 1;
    }
    return mask;
}

class Cell {
    int ones;
    int stars;
}

int difference(Cell x, Cell y) {
    int distance = 0;
    return (x.ones & ~y.stars) ^ (y.ones & ~x.stars);
}

int hammingDistance(Cell x, Cell y) {
    return Integer.bitCount(difference(x, y));
}

boolean differsBy1(Cell x, Cell y) {
    int diff = difference(x, y);
    return diff == 0 ? false : (diff & (diff - 1)) == 0;
}

int bitPosition(int diff) {
    return Integer.numberOfTrailingZeroes(diff);
}
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • 1
    Definitely agree with this (bit-mask) approach, but two minor points: (1) I see no "specification [to] Start with bits[1]" and I think position [0] in the String maps naturally to 1<<(0) with value 1, [1] to 1<<(1) with value 2, etc. and (2) question says k is "usually around 50" which doesn't fit in Java `int` (32 bits) but does fit in `long` (64 bits). If k is 65..128 use two `long` and 'differsBy1` becomes slightly more complicated `hidiff==0 && lodiff!=0 && (lodiff&(lodiff-1))==0 || hidiff!=0 && (hidiff&(hidiff-1))==0 && lodiff==0`. – dave_thompson_085 Jul 18 '15 at 10:07
  • @dave_thompson_085 thanks for putting that much thought into it. – Joop Eggen Jul 18 '15 at 13:04
0

Interesting problem. It would be easy it weren't for the wild card symbol.

If the wildcard was a regular character in the alphabet, then for a given string you could enumerate all k hamming distance 1 strings. Then look these strings up in a multi-map. So for example for 101 you look up 001,111 and 100.

The don't care symbol makes it so that you can't do that lookup. However if the multi-map is build such that each node is stored by all its possible keys you can do that lookup again. So for example 1*1 is stored as 111 and 101. So when you do the look up for 10* you look up 000,010,011,001,111 which would find 1*1 which was stored by 111.

The upside of this is also that you can store all labels as integers rather then trinary structures so with an int[3] as the key value you can use any k < 96.

Performance would depend on the backing implementation of the multi-map. Ideally you'd use a hash implementation for key sizes < 32 and a tree-implementation for anything above. With the tree-implementation all nodes be connected to their distance-1 neighbors in O(n*k*log(n)). Building the multi-map takes O(n * 2 ^ z) where z is maximum number of wildcard characters for any string. If the average number of wildcards is low this should be an acceptable performance penalty.

edit: You improve look up performance for all nodes to O(n*log(n)) by also inserting the hamming distance 1 neighbors into the multi-map but that might just explode its size.

Note: I'm typing this in a lunch break. I haven't checked the details yet.

M.P. Korstanje
  • 10,426
  • 3
  • 36
  • 58