9

I bet somebody has solved this before, but my searches have come up empty.

I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.

Example: doll dollhouse house

These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.

What I've come up with so far is:

  1. Sort the words longest to shortest: (dollhouse, house, doll)
  2. Scan the buffer to see if the string already exists as a substring, if so note the location.
  3. If it doesn't already exist, add it to the end of the buffer.

Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.

This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.

As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm

lemzwerg
  • 772
  • 7
  • 15
Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
  • 3
    Can't you just use something like gzip? – Zifre May 10 '09 at 13:21
  • What you're describing is what all compression algorithms do, except you're adding the constraint of looking at plain text words as the elements being compressed rather than bits. – Richard Nichols May 10 '09 at 13:44
  • 2
    It's not quite the same as compression algorithms, because each word must maintain its "wordiness". Like I said in another comment, you can't combine "lawman" and "woman", but in compression, it'd be fine to compress "man" together because you don't need to maintain one consistent buffer. – Dan Lew May 10 '09 at 13:46
  • Also, FWIW, the solution should be able to capitalize on multiple suffix and prefix matches. So if my wordlist had "lawman", "woman", "manage" and "mangle", it should be able to form "lawmanage" and "womangle". – Dan Lew May 10 '09 at 13:47
  • Daniel Lew is on the right track. I'm looking for packing, not compression. Maybe I'll just use a genetic algorithm to find a decent packing. – Adrian McCarthy May 10 '09 at 16:47
  • I'd just like to point out that packing is still a form of compression - there's no fundamental difference whatsoever. You just want a compression algorithm that meet certain constraints (you choose to call "packing"), presumably so that decompression is trivial (but still decompression). – Draemon Jul 31 '09 at 16:10
  • @Draemon: The difference between compression and packing is that compressed data needs to be decompressed. Packing doesn't require decompression, just an index. – Adrian McCarthy Jul 20 '10 at 16:38
  • 1
    @Adrian: That's a false distinction. Yes, you can decompress indexed packed data in-place by accessing the index, and I agree this scheme is particularly well suited to that use, but it's still compression; there's a processing step to access the original data. Other compression can be done in-place too. – Draemon Jul 21 '10 at 06:09

8 Answers8

15

This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.

As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.

Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.

I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.

Community
  • 1
  • 1
j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • Thanks! Having a name for the problem is always a great start. I figured a perfect solution might be out of reach, but a good solution would be satisfying. – Adrian McCarthy May 10 '09 at 15:52
1

I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).

Zifre
  • 26,504
  • 11
  • 85
  • 105
Qubeuc
  • 982
  • 1
  • 11
  • 22
  • 1
    I believe that only works with strings that start with common substrings. Strings that end with common substrings will not be recognized. Correct me if I'm wrong. – Zifre May 10 '09 at 13:31
  • 1
    If strings end with a common substring, they wouldn't be matched up anyways based on this description. Doing so would cause the individual strings to become messed up. – Dan Lew May 10 '09 at 13:41
  • To elaborate, if you had "woman" and "lawman", you cant combine them even if you wanted to. The only way combination works (as I understand the problem) is if a suffix of one word matches a prefix of another. – Dan Lew May 10 '09 at 13:43
1

My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 2
    What you are suggesting sounds like it could be implemented with a double radix tree (one forward and on backward). This would work in most cases, but if the strings have common parts in the middle, but not on the edges, it won't work. – Zifre May 10 '09 at 13:34
  • For an example, it wouldn't recognize consuming and sum. – Zifre May 10 '09 at 15:48
1

Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.

friol
  • 6,996
  • 4
  • 44
  • 81
  • 1
    Could you just explain to us the link with the Knapsack Problem? – akappa May 10 '09 at 15:14
  • The Knapsack problem (optimally packing some goods in a bag) looked similar to me. In fact (see j_random_hacker's answer) this is a NP-complete problem, like the Knapsack one. – friol May 10 '09 at 15:27
  • Yes, but I still can't see the similarity of that problem with the KP. 3-SAT is NPC, but I can't certainly say that it is similar to that "string packing" problem. – akappa May 10 '09 at 15:34
  • The "bag" is the string with the shortest length (the "optimally packed" one). Packing the goods into the bag is similar to adjusting the substrings in the "main" one: in both cases you have constraints (substring constraint or total weight limitation). – friol May 10 '09 at 15:42
  • IMHO the substring constraint makes the nature of the problem dramatically different, but nevermind ;) – akappa May 10 '09 at 15:52
1

I did a lab back in college where we tasked with implementing a simple compression program.

What we did was sequentially apply these techniques to text:

  • BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
  • MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
  • Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols

Here, I found the assignment page.

To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.

Cᴏʀʏ
  • 105,112
  • 20
  • 162
  • 194
  • Interesting, but pretty much irrelevant to the question at hand. Also, it's usual to put a Run Length Encoding step in before the MTF. :) – Nick Johnson May 10 '09 at 15:58
1

Refine step 3.

  • Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
  • If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
  • If no, add word to end of list as in current step 3.

This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
0

I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?

Here are a few good choices:

  • gzip for fast compression / decompression speed
  • bzip2 for a bit bitter compression but much slower decompression
  • LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
  • lzop for very fast compression / decompression

If you use Java, gzip is already integrated.

martinus
  • 17,736
  • 15
  • 72
  • 92
  • I'm not after packing, not compression. At run-time, I want the full text of each word readily accessible. I could do that without any sort of packing, but I recognized that packing could give me a significant reduction in footprint and improved locality of reference. – Adrian McCarthy May 10 '09 at 16:42
  • how is your packing & unpacking different from any other compression and decompression algorithm? – martinus May 11 '09 at 11:42
  • With compression, you have to decompress. With packing as I've described, there's no unpacking required. I have the full text of the original words directly available. – Adrian McCarthy May 11 '09 at 17:34
0

It's not clear what do you want to do.

Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?

Do you just want an array of words, compressed?

In the first case, you can go for a patricia trie or a String B-Tree.

For the second case, you can just adopt some index compression techinique, like that:

If you have something like:

aaa 
aaab
aasd
abaco
abad

You can compress like that:

0aaa
3b
2sd
1baco
2ad

The number is the length of the largest common prefix with the preceding string. You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

akappa
  • 10,220
  • 3
  • 39
  • 56
  • Note that, with the last schema, you should compress much more than a packing like you've suggested. Of course you can't just have one pointer to the word, but a tuple (pointer to the first word with 0 prefix, offset) – akappa May 10 '09 at 15:36
  • I'm not looking for a compression method. I need fast random-access to the full text of each word, so I don't want to decompress on the fly. Packing reduces the memory footprint and improves locality of reference. – Adrian McCarthy May 10 '09 at 16:44
  • Are you sure that it improves locality? Locality depends largely upon the order wich you request words, not only the memory footprint (except edge cases, of course). And are you really sure that it improves largely the memory footprint? It seems to me that this optimization can be a good thing if you have a particular set of strings, but it's pratically useless on, for ex., natural language words. – akappa May 10 '09 at 18:12