3

A binary string as defined here is fixed size "array" of bits. I call them strings since there is no order on them (sorting/indexing them as numbers has no meaning), each bit is independent of the others. Each such string is N bits long, with N in the hundreds.

I need to store these strings and given a new binary string query for the nearest neighbor using the Hamming distance as the distance metric.
There are specialized data-structures (metric-trees) for metric-based search (VP-trees, cover-trees, M-trees), but I need to use a regular database (MongoDB in my case).

Is there some indexing function that can be applied to the binary strings that can help the DB access only a subset of the records before performing the one-to-one Hamming distance match? Alternatively, how would it be possible to implement such Hamming based search on a standard DB?

tskuzzy
  • 35,812
  • 14
  • 73
  • 140
Adi Shavit
  • 16,743
  • 5
  • 67
  • 137
  • "I call them strings because there is no order on them" - strings have order - lexicographical, specifically. – Nick Johnson Jul 07 '11 at 00:50
  • Of course, yet usually sequences of bits are referred to as "numbers", or integers to be exact, which do have a natural order. – Adi Shavit Jul 07 '11 at 07:19

2 Answers2

4

The hamming distance is a metric so it satisfies the triangle inequality. For each bitstring in your database, you could store the it's hamming distance to some pre-defined constant bitstring. Then you can use the triangle inequality to filter out bitstrings in the database.

So let's say

C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)

So for each B, you would store it's corresponding distB.

A lower bound for hamming(B,S) would then be abs(distB-distS). And the upper bound would be distB+distS.

You can discard all B such that the lower bound is higher than the lowest upper bound.

I'm not 100% sure as to the optimal way to choose which C. I think you would want it to be a bitstring that's close to the "center" of your metric space of bitstrings.

tskuzzy
  • 35,812
  • 14
  • 73
  • 140
  • 2
    This is equivalent to storing a Bk-tree with only the root node. It's going to help a bit, but it's still going to leave a lot of search space to examine. You could store the distance to multiple strings (ala a FHFQ bk-tree), but that'd require a query with multiple inequalities, which can't be evaluated efficiently on a 1-dimensional index. – Nick Johnson Jul 07 '11 at 00:52
  • @tskuzzy: Thanks, just what I wanted. Just to clarify the reply: Let minDistB be the smallest distB of Bs in the database, then discard all B such that: abs(distB-distS) > distS + minDistB. – Adi Shavit Jul 07 '11 at 08:50
  • @Nick Johnson: You're right (thanks for the BK-Tree post link). Indeed, the obvious extension would be to index against several such Cs, spread around the space and have the database index several Bs, one per C. – Adi Shavit Jul 07 '11 at 08:56
  • @Adi Right. There's no way to satisfy a query with multiple inequalities efficiently against a 1-dimensional index, though, so the savings wouldn't be as large as using an actual tree-structure to store the data. – Nick Johnson Jul 08 '11 at 00:12
2

You could use an approach similar to levenshtein automata, applied to bitstrings:

  1. Compute the first (lexicographically smallest) bitstring that would meet your hamming-distance criteria.
  2. Fetch the first result from the database that's greater than or equal to that value
  3. If the value is a match, output it and fetch the next result. Otherwise, compute the next value greater than it that is a match, and goto 2.

This is equivalent to doing a merge join over your database and the list of possible matches, without having to generate every possible match. It'll reduce the search space, but it's still likely to require a significant number of queries.

Nick Johnson
  • 100,655
  • 16
  • 128
  • 198