1

If I represent DNA as binary values, what is the best way of computing distance between them.

So : A = 00, T = 11, G = 01 and C = 10

Hamming Distance between ATGC and TAAC is 3, however their binary representations give a different answer:

Hamming Distance of 00110110 and 11000010 = 5.

Whats the best way of distance computation if the DNA bases are represented in this way?

Maljam
  • 6,244
  • 3
  • 17
  • 30
Y.V
  • 15
  • 1
  • 4

1 Answers1

0

You could use binary operations to do something like this (in Java, but you can apply the logic in any language):

int seq1 = 54, seq2 = 194;//ATGC and TAAC
int evenBit = 0xAAAAAAAA, oddBit = 0x55555555;

int pseudoDist = seq1 ^ seq2;
int dist = (pseudoDist&evenBit)>>1;
dist |= pseudoDist&oddBit;
int finalDist = Integer.bitCount(dist);//output 3

The idea is to get the total number of bits that are different with: seq1 ^ seq2

But you can't just count the bits yet, because you will get the hamming distance instead, so you have to compress all the bits that correspond to the same nucleotide to the same bit using: (pseudoDist&0xAAAAAAAA>>1) and pseudoDist&0x55555555. The first one keeps the bits on even positions and the second the ones on odd positions.

Now you use evenBits | oddBits, and you can count the bits.

Maljam
  • 6,244
  • 3
  • 17
  • 30