4

I need to store trillion of list of URLs where each URL list will contain ~50 URLs. What would be the most space efficient way to compress them for on-disk storage.

I was thinking of first removing useless information like "http://" and then build a minimal finite state automaton and save this.

An other option is to build a string of comma separated URL and compress this string using regular compression such as GZIP or BZ2.

If I don't care about speed which solution would result in the best compression.

skyde
  • 2,816
  • 4
  • 34
  • 53
  • 1
    What operations do you need to perform on the list? That will probably inform the choice of data structure. – templatetypedef Mar 07 '14 at 19:51
  • I just need to be able to iterate over all URL in the list. I don't care about how much space it take in memory or how much time it take to compress/decompress it. It just need it to be very small once stored on disk. – skyde Mar 07 '14 at 19:59
  • To confirm - you want a small on-disk representation that you can then inflate into a larger in-memory structure if necessary? Also, if you only have 50 URLs, are you sure this sort of compression is even necessary? – templatetypedef Mar 07 '14 at 20:01
  • Yes because I will have to store a trillion of those list. – skyde Mar 07 '14 at 20:10
  • 1
    No I mean I need to store a big number of list where each list is about 50 URLs long – skyde Mar 07 '14 at 20:13
  • 2
    I think the best option here is to try a bunch of results and profile them to see which ones give the best space usage. A standard compression algorithm will probably do a great job here, though a more advanced structure like a minimum-state DFA (called a DAWG in this context, by the way) run through a compressor might be better. – templatetypedef Mar 07 '14 at 20:15
  • 1
    Do you realize those lists would talk up half a petabyte even if you achieve an unlikely compression ratio of 10:1? Just to make sure, since probably you do realize that. – Niklas B. Mar 07 '14 at 20:23
  • 1
    I would imagine a compressed bitwise Trie would be useful. – Nuclearman Mar 07 '14 at 20:58
  • As templatetypedef suggested, It seem that a Compact Directed Acyclic Word Graph would work best. – skyde Mar 07 '14 at 22:50
  • Somehow, I think your 50 trillion URLs is the proposed solution to a larger problem. 50 trillion is an astonishingly large number of URLs. Are they all unique? Is it possible that there's a better solution to your larger problem that wouldn't require you to store hundreds of terabytes of data? – Jim Mischel Mar 08 '14 at 00:02
  • Some URL are in several list. So they are not globally unique but they are unique inside each list. One option would be to store a global dictionary mapping all unique URL to an id, and than storing each list of URL as a list of id. But the problem become even harder, how do you store and update the huge Dictionary of unique URLs. It also make decompressing each small list much harder. – skyde Mar 08 '14 at 02:00

2 Answers2

1

Given the amount of URLs and the fact that most of them use more or less the same structures and naming patters, I would go with using an index and a tokenizer. First use a tokenizer to gather as many words as possible and save them in an index. You can then replace each token by its index in the list:

http://www.google.com/search?q=hello+world (42 bytes)== would give you

http:// => 1 www. => 2 google.com => 3 search => 4 hello => 5 world => 6

and the URL will become: 1,2,3,'/',4,'?','q','=', 5,'+',6

Given the fact that a lot of URLs will be subdomains of a common big domain and that most of them will use the same common English words (think of all the about us pages or careers...), you will probably end up with a not so big index (there is about 50000 usual words in english, 70000 in french).

You can then compress the index and the tokenized URLs to gain even more space.

There are O(n) and O(nlogn) algorithms for parsing the URLs and building the index.

Samy Arous
  • 6,794
  • 13
  • 20
  • That's not particularly good compression. Assuming that the indexes are 32 bit integers, your compressed url 24 bytes for the indexes and 4 more bytes for the individual characters. So the compression ratio is 3:2. Not very good at all. You could do better than that with a fixed Huffman encoder. – Jim Mischel Mar 07 '14 at 23:55
  • If you take into account one single URL yes, but consider the fact that each integer will be present hundreds or thousands of time (think of the the word "and" or "hello"). The more URLs you have, the better is the compression ratio. This is actually a very efficient compression algorithm, but it can be refined of course. I'm just giving the general idea – Samy Arous Mar 08 '14 at 00:00
  • I like the idea of having a global dictionary to store hosts ex:"http://www.google.com/" which let me replace it with a 4byte id thus each entry become (4byte host id) + (20 byte string "search?q=hello+world"). Then the list could be further processed with GZIP to compress the paths. – skyde Mar 08 '14 at 01:26
0

After investigating it seem that just using GZIP compress better than just using a Compact Directed Acyclic Word Graph!

skyde
  • 2,816
  • 4
  • 34
  • 53