Collapsing set of strings based on a given hamming distance

Question

Given a set of strings (first column) along with counts (second column), e.g.:

aaaa 10
aaab 5
abbb 3
cbbb 2
dbbb 1
cccc 8

Are there any algorithms or even implementations (ideally as a Unix executive, R or python) which collapse this set into a new set based on a given hamming distance.

Collapsing implies adding the count
Strings with a lower count are collapsed into strings with higher counts.

For example say for hamming distance 1, the above set would collapse the second string aaab into aaaa since they are 1 hamming distance apart and aaaa has a higher count. The collapsed entry would have the combined count, here aaaa 15

For this set, we'd, therefore, get the following collapsed set:

aaaa 15
abbb 6
cccc 8

Ideally, an implementation should be efficient, so even heuristics which do not guarantee an optimal solution would be appreciated.

Further background and motivation

Calculating the hamming distance between 2 strings (a pair) is been implemented in most programming languages. A brute force solution would compute compute the distance between all pairs. Maybe there is no way around it. However e.g. I'd imagine an efficient solutions would avoid calculating the distance for all pairs etc. There are maybe clever ways to save some calculations based on metric theory (since hamming distance is a metric), e.g. if hamming distance between x and z is 3, and x and y is 3, I can avoid calculating between y and z. Maybe there is a clever k-mer approach, or maybe some efficient solution for a constant distance (say d=1).

Even if it there was only a brute force solution, I'd be curious if this has been implemented before and how to use it (ideally without me having to implement it myself).

I'm well aware of the how to calculate a between 2 strings, but the question is referring to set(!) of stings, which not as trivial (e.g. I'd imagine an efficient solutions would avoid calculating the distance for all pairs etc). In fact I researched this quite a bit and are not aware of a program that solves the posed example. Could you provide this in case I'm missing something? — Sebastian Müller, Nov 06 '19 at 14:48
I see no other alternative than brute force it with a loop. There might be a way to speed it up with the clever use of a sorting algorithm, mixed with the hamming calculation. I will retract my previous comment since the added complexity of the group of strings went over my head :-) — Nic3500, Nov 06 '19 at 14:52
No worries! I'd imagine there are probably clever ways to save some calculations based on metric theory etc (e.g. if hamming distance between x and z is 3, and x and y is 3, I can avoid calculating between y and z). Maybe there is a clever k-mer approach. Even if it was a brute force solution, I'd be curious if this has been implemented before. — Sebastian Müller, Nov 06 '19 at 15:15
Also, could whoever downvoted the question elaborate on why? This is so I can improve this and/or future questions! — Sebastian Müller, Nov 06 '19 at 15:20
That was me, I could not remove it for a while. It's gone now. — Nic3500, Nov 06 '19 at 15:31
Maybe someone on //https://math.stackexchange.com/ will have an idea for an algorithm, but cross-posting is frowned upon. If you don't get an answer here in a day or two, delete here and repost there? Good luck. — shellter, Nov 06 '19 at 16:20
@SebastianMüller The additional information you wrote in comments (what you already know or thought about) should be part of the question. Comments are for clarification requests or hints for improvement. Please [edit] your question and add all relevant information to the question. — Bodo, Nov 06 '19 at 17:30
@shellter I might do this, however this is more of a computer science question, so I think it belongs here in that form. Would it be frowned upon if I rephrased it towards a math questions, e.g. focusing on a mathematical solutions rather than a concrete implementation? — Sebastian Müller, Nov 07 '19 at 09:51
@Bodo Good point, I've edited the questions accordingly and hope this ok now. — Sebastian Müller, Nov 07 '19 at 09:53

Dan D. · Answer 1 · 2019-11-07T18:11:02.517

I thought up the following:

This reports the item with the highest score with the sum of its score and the scores of its near by neighbors. Once a neighbor is used it is not reported separately.

I suggest using a Vantage-point tree as the metric index.

The algorithm would look like this:

construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
for the string with the highest score in the max heap:
use the metric index to find the near by strings
print the string, and the sum of its score and its near by strings
remove from the metric index the string and each of the near by strings
remove from the max heap the string and each of the near by strings
repeat 3-7 until the max heap is empty

Perhaps this could be simplified by using a used table rather than removing anything. The metric space index would not need to have efficient deletion nor would the max heap need to support deletion by value. But this would be slower if the neighborhoods are large and overlap frequently. So efficient deletion might be a necessary difficulty.

construct the metric index from the strings and their scores
construct the max heap from the strings and their scores
construct the used table from an empty set
for the string with the highest score in the max heap:
if this string is in the used table: start over with the next string
use the metric index to find the near by strings
remove any near by strings that are in the used table
print the string, and the sum of its score and its near by strings
add the near by strings to the used table
repeat 4-9 until the max heap is empty

I can not provide a complexity analysis.

I was thinking about the second algorithm. The part that I thought was slow was the checking of the neighborhood against used table. This is not needed as deletion from a Vantage-point tree can be done in linear time. When searching for the neighbors, remember where they were found and then remove them later using these locations. If a neighbor is used as a vantage-point, mark it as removed so that a search will not return it, but leave it alone otherwise. This I think restores it to below quadratic. As otherwise it would be something like number of items times size of neighborhood.

In response to the comment. The problem was "Strings with a lower count are collapsed into strings with higher counts." as such this does compute that. It is not a greedy approximation that could result non-optimal result as there was nothing to maximize or minimize. It is an exact algorithm. It returns the the item with the highest score combined with the score of its neighborhood.

This can be viewed as assigning a leader to each neighborhood such that each item has at most one leader and that leader has the largest overall score so far. This can be viewed as a directed graph.

The specification wasn't for dynamic programming or optimization problem. For that you would ask for the item with the highest score in the highest total scoring neighborhood. That can also be solved in a similar way by changing the ranking function strings from its score to the pair of the sum of its score and its neighborhood, and its score.

It does mean that it can't be solved with a max heap over the scores as removing items affects the neighbors of the neighborhood and one would have to recalculate their neighborhood score before again finding the item with the highest total scoring neighborhood.

This looks like a greedy algorithm, which doesn't guarantee an optional solution, but should be good enough and efficient. I suppose you don't know of any software that has this implemented? Also, what do you mean by metric index? A data frame? — Sebastian Müller, Nov 07 '19 at 16:33
You are right, I didn't specify what to maximize. I suppose under this this strategy can't be deemed greedy (which I'd be fine with regardless, since I'm not after guaranteed optimal solution anyway). I'll accept it in only a few days still in case someone has an actual implementation since I wanted to avoid implement it myself . If I had to (probably in python), could you still clarify what you mean by `metric index`? — Sebastian Müller, Nov 08 '19 at 15:22
I'm trying to implement the above in python but struggle with point 7, since max-heaps allow only to remove the max elements (`heapq.heappop`), but not any other elements (e.g. the nearby stings). https://docs.python.org/3.7/library/heapq.html. Happy to share the code or discuss — Sebastian Müller, Nov 12 '19 at 09:56
That is why I described a version that didn't require removal, the second version and the note on the second version, that uses a set to keep a list of the removed items so they can be ignored when retrieved from the max heap. And as they are encountered they can be removed from the set. — Dan D., Nov 13 '19 at 19:48

Collapsing set of strings based on a given hamming distance

Further background and motivation

1 Answers1