Compressing words into one word consisting of them as subwords

Question

I stumbled upon an algorithmic problem which, in short, could be stated as follows:

We have n words as input. Try to compress the words in such a way to create one new, as short as possible, word consisting of all the "old" words as connected subwords (you can derive any of the "old" words by crossing out all the letter before and after some interval in the new word).

Example:

{aabb, bbcc, c} could be compressed as {aabbcc}
{a,a,a,a,a,a} could be compressed as {a}

The only idea I have is the dumbest brute force one could think of - we take the first word, check how long is the maximum common part of this word and every other by trying to match their starts and their ends, then connect the ones which gave the biggest overlap. Replace them both in our list with the newly created word. Repeat until we are left with one word.

The problem with this solution is not only that it's gonna be tragically slow but also it doesn't seem to be giving good answers in some cases. Say we have {aabb, bbaa, b} - it would connect the first two because their overlap is 2 and only 1 with the third one. Thus, we get {aabbaa, b} => {aabbaab} while we could have done {aabbaa}. A way to address it would be to take the percentage of a word a given overlap is apart from the bare length of a substring but I'm not sure if such tweaking of a faulty approach is a good idea…

What would you suggest for such problem to get the best results in the shortest time?

Assuming you can't (necessarily) get both the best result AND the shortest time, which are you willing to sacrifice, and how much? For example, if speed is of utmost importance, just pick an order and remove the overlaps. If result is most important, you could try ALL possible orders (though you could probably do somewhat better, speed-wise). — Scott Hunter, Mar 21 '14 at 14:09
How much words do you have? The problem can be reduced to traveling salesman problem, so if n is not big, you can just use it's optimal solution. — Natalya Ginzburg, Mar 21 '14 at 14:14
@ScottHunter - I'd go with best results over time here cause it's a problem from past algorithmic competition where you just had to send the results, not the code. Was just curious if there's some more elegant way than brute forcing your way through. Why bother with elegancy when you are not limited with time, on the other hand, though... — Straightfw, Mar 21 '14 at 14:19
@NatalyaGinzburg - There were up to 500 words, 500 chars length each on this competition with this problem. — Straightfw, Mar 21 '14 at 14:20
Then your solution is not that slow - you can precompute overlap in O (n^3) and then it just takes O (n^2) steps... But it's still not optimal. Are you sure the problem has no additional constraints? Here is a similar problem, but it is easier: http://rosalind.info/problems/long/ — Natalya Ginzburg, Mar 21 '14 at 14:54

Compressing words into one word consisting of them as subwords

0 Answers0