2

I have a list of millions of street names and want to compress them using a compression algorithm. I am not sure which algorithm would fit best. Most street names have common substrings in them, such as for example "street", "way", ...

The set of all street names is fixed and won't change dynamically.

At first I was thinking of huffman coding, but that only codes single letters, so it won't give great performance. So I thought of generating a trie and counting the most common substrings. Then I could have some sort of code to traverse this trie in order to get the word back, and compress these codes using something like huffman coding. I'm not sure if this is not making it more complicated than it needs to be.

Does anyone know a compression technique that makes sense in my case?

EDIT 1

My usecase is thus: I have a phone device with limited storage size. This phone needs to hold all the street names of all streets in a particular country. Now every street object has some values and amongst them the name of the street as a string. That takes up most space and I would like to minimize it. Since the names are quite similar, i.e. most ending on "...street" or "...way", I thought it may be worth implementing a specific compression algorithm geared towards this scenario.

A simple gzip has brought a compression of about 50%. I think it should be possible to get more out of it.

EDIT 2

The solution of Ebbe M. Pedersen is actually giving very good performance results. Here is some code (written in C#):

    private IndexedItem[] _items;

    public void CompressStrings(string[] strings)
    {
        Array.Sort(strings);
        _items = new IndexedItem[strings.Length];

        string lastString = string.Empty;

        for (int i = 0; i < strings.Length; i++)
        {
            byte j = 0;
            while (lastString.Length > j && lastString[j] == strings[i][j])
            {
                j++;
            }

            _items[i] = new IndexedItem() { Prefix = j, Suffix = strings[i].Substring(j) };

            lastString = strings[i];
        }
    }

    private struct IndexedItem
    {
        public byte Prefix;
        public string Suffix;
    }

After compression I am also sending it through a DeflateStream, which results in a total compression of about 30%

Thanks very much for the answers

Christian
  • 4,345
  • 5
  • 42
  • 71
  • How are you going to use your compressed data? – Serg Mar 26 '13 at 19:32
  • 1
    Do you need to implement it yourself? I'd just test some compression libraries and use whatever works best. I imagine [LZMA](http://en.wikipedia.org/wiki/Lempel-Ziv-Markov_chain_algorithm) would be good. – Blorgbeard Mar 26 '13 at 19:33
  • Not sure exactly what your use case is, but would gzip/bzip be sufficient? – mon4goos Mar 26 '13 at 19:32
  • I added my use-case, I hope this clearifies it. Basically I will query the street (which has some id) and ask it for its street name – Christian Mar 26 '13 at 21:29
  • Thanks for adding the use case discussion. I'm still a bit confused about how you intend to use this exactly. I see that the compressions needs to be in place during storage and not just during app transport? Will you be looking up street names in a table? – Multimedia Mike Mar 27 '13 at 02:33
  • Basically on the phone I will save millions of streets. Each street has several properties, like: name, coordinates where it is, driving speed limit, ... Now amongst these properties, the name is the most space-wasting. I will query for a specific street (by id) and want to display the name of the street. The streets will be dynamically loaded from storage as soon as they are needed. I want to compress them as much as possible to minimize the size of the app and its data. – Christian Mar 27 '13 at 08:22
  • Nice to see my suggestion in action :) – Ebbe M. Pedersen Mar 28 '13 at 19:10

3 Answers3

2

Depending on your data set, you could start by ordering your street names, and then represent every street name as a substring of the previous streetname + the 'different part'.

An example with some similar street names:

      How much to copy from previous street name in Hex 
                         | The rest of the street name
Original                 V   V V V            Orig size  New size
Broadwalk                0 Broadwalk             9         10
Broadwater               7 ter                   8          4
Broadwater Access        A  Access              17          8
Broadwater Bluff         B Bluff                16          6
Broadwater Branch        C ranch                17          6
Broadwater Bridge        D idge                 17          5
Broadwater Cemetary      B Cemetary             19          9
Broadwater Creek         C reek                 16          5
Broadwater Point         B Point                16          6
Broadwater Pvt           C vt                   14          3
Broadwaters              A s                    11          2
Broadway                 7 y                     8          2
Broadway And Union       8  And Union           18         11
Broadway Apartments      9 partments            19         10
Broadway Avenue          9 venue                15          6
                                               ---        ---
                                               220         93

You would need to process a range of names to be able to get to the real one, but if you make a convention of fully spelling out every n record you can optimize it to your needs.

Combine this with only using 5-6 bit per letter, and maybe do some common substring replacements, you ought to be able to beet the 50% you see with bzip.

Ebbe M. Pedersen
  • 7,250
  • 3
  • 27
  • 47
  • This is actually a very good idea. So far, I have been searching for the longest common substrings inside the names amongst all names. The running time is pretty high, but I have about 1000 computers to do this in parallel. So it is possible. Using that algorithm I found patterns like "street", "way", and many more. On its own it gives a compression rate of about 50%, but combined with your idea it could be really interesting! – Christian Mar 28 '13 at 14:26
1

Using a algorithm with static dictionary coding would be better. You can have a try with my toy compression util: http://code.google.com/p/comprox. (comprop component)

But the best way is that you make a lossless transform to your data before passing it to a general purpose compression program, since you have better understanding of your data.

richselian
  • 731
  • 4
  • 18
0

Do not use Huffman, LZ algorithms are best suited for this.

I'd suggest you combine all the street names into a single text file (only the street names). Each street name should be NULL terminated that will help pull the individual string out. Compress this file. Nevertheless you will have to figure out how to manage it in the, perhaps, limited memory of the mobile device.

Also, take a look at SMAZ

Ujjwal Singh
  • 4,908
  • 4
  • 37
  • 54
  • hm, SMAZ is geared towards the english alphabet, thus compressing words like "the" into one bit. For my particular case, it won't give such a good compression. Especially because I need to compress the single names seperately, not a big text. – Christian Mar 28 '13 at 14:24