0

I am trying to use the Viterbi min-sum algorithm which tries to find the pathway through a bunch of nodes that minimizes the overall Hamming distance (fancy term for "xor two numbers and count the resulting bits") against some fixed input.

I understand find how to use DP to compute the minimal distance overall, but I am having trouble using it to also capture the corresponding path that corresponds to the minimal distance.

It seems like memoizing the path at each node would be really memory-intensive. Is there a standard way to handle these kinds of problems?

Edit:

https://i.stack.imgur.com/wsLfg.jpg

Here is a sample trellis with what I am talking about. The general idea is to find the path through the trellis that most closely emulates the input bitstring, with minimal error (measured by minimizing overall Hamming distance, or the number of mismatched bits).

As you can see, the first chunk of my input string is 01, and I can traverse there in column 1 of the trellis. The next chunk is 10, and I can move there in column 2. Next chunk is 11. Fine so far. Next chunk is 10, which is a problem because I can't reach that state from where I am now, so I have to go to the next best thing (00) and the rest can be filled fine.

But this can become more complex. I'd need to be able to somehow get the corresponding path to the minimal Hamming distance.

(The point of this exercise is that the trellis represents what are ACTUALLY valid transitions, whereas the input string is something you receive through telecommunicationa and might get garbled and have incorrect bits here and there. This program tries to figure out what the input string SHOULD be by minimizing error).

2 Answers2

1

There's the usual "follow path backwards" technique, requiring only the table of values (but the whole table of values, no cheating with "keep only the most recent part"). The algorithm is simple: start at the end, decide which way you came from. You can make that decision, because either there's exactly one way such that if you came from it you'd compute the value that matches the stored one, or several result in the same value and it wouldn't matter which one you chose.

Storing also a table of "back-pointers" doesn't take much space (about as much as the table of weights, but you can actually omit most of the table of weights if you do this), doing it that way allows you to have a much simpler backwards phase: just follow the pointers. That really is the path, just stored backwards.

harold
  • 61,398
  • 6
  • 86
  • 164
  • It is precisely the "follow path backwards" technique that is considered prohibitively space expensive in many fields. Rather than storing the DP "frontier", it requires storing all past ones, which is space quadratic relative to what you need for just the numeric solution. There are ways around that that don't alter the asymptotic time complexity. – Ami Tavory Jul 16 '15 at 15:49
  • I added a picture for clarity. – Jonathan Ryder Jul 16 '15 at 15:59
  • @JonathanRyder so what's the actual scale of your problem? – harold Jul 16 '15 at 16:40
  • @harold the number of states (rows) is not large -- maybe <200. The number of columns though might be several hundred. – Jonathan Ryder Jul 16 '15 at 16:43
  • @JonathanRyder so that's nothing - let's say it's 200*1000, that's about 200KB if you use bytes for the backpointers – harold Jul 16 '15 at 16:46
  • @harold I am embarrassed to admit I do not fully understand how backpointers would work. Looking to my example that I posted, I assume I start from the "end" (row 0, state 7) in my table, and then I recurse back to all valid states until I hit the start (row 0, state 0). When/where would I store "pointers"? Would this be a separate cache? One for minimal distance, one for pointers? – Jonathan Ryder Jul 16 '15 at 16:49
  • @JonathanRyder yes in a separate table. Only the one needs to be full-sized then, you can use the usual "two columns" trick for the weight then since you don't need them all anymore. BTW what recursion? It's just a loop right? – harold Jul 16 '15 at 16:56
  • @harold I guess it could be done either way (DP can either be done as recursion with memo or by looping over arrays). – Jonathan Ryder Jul 16 '15 at 16:58
  • @JonathanRyder I suppose, but can you still optimize the space in that case? – harold Jul 16 '15 at 16:59
  • @harold No idea. I figure they both require the same amount of space. – Jonathan Ryder Jul 16 '15 at 17:09
1

You are correct that the immediate approach for calculating the paths, is space expensive.

This problem comes up often in DNA sequencing, where the cost is prohibitive. There are a number of ways to overcome it (see more here):

  • You can reduce up to a square root of the space if you are willing to double the execution time (see 2.1.1 in the link above).

  • Using a compressed tree, you can reduce one of the dimensions logarithmically (see 2.1.2 in the link above).

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185