Need a highly compressed datastore for Crawl data and log data

Question

I have to store a lot of crawl and log data in a Datastore with an efficient compression ratio.

So far I tried and installed Cassandra, Couchbase, Mysql and an FlatFile format and read the architectual overview of Google Big Table, Hypertable and the LevelDB File Layout.

Cassandra and Couchbase are about 1/5 in disk size of the uncompressed Mysql Database, but I want better results.

So I need a Simple Data Store with high compression features as in vertica, teradata, oracle and sqlserver products. (Page level compression)

The actual flatFile dataSet looks like

/oil_type/gas_station/2014-03/2014-03-05-23.csv
/oil_type/gas_station/2014-03/2014-03-06-00.csv
/oil_type/gas_station/2014-03/2014-03-06-01.csv

Per File are about 400 high redundant entries each about 5kb A File can be compressed from 1722 KB to 39 KB so an compression ratio of 44:1 up to 100:1 depending on the compression chunk size should be possible.

Defining the use case:

I have to poll all relevant gas_station webpages/apis every 30 seconds to get up to the minute pricing information, because it is not possible to write a parser for every gas station, a generic solution is required for index creation. With a database holding all craweld gas station pages a generic parser can easily be developed and backtestet. With this raw data model data loss through broken specific converters should be avoided.

With keys like "oil_type-gas_station-timestamp-content", its easy and efficient to compare two gas_station pricings over time. For reading a Time Serie that is smaller then the compression chunk size only 2 to 4 chunks should be decompressed.

So the following features are optimal:

SSTables
Configurable Compression Options (Level,Compression Engine,Chunk Size (from 64kb to 10 MB))
Range Scans
Java Bindings
column datasore for better compression

Nice to have:

Replication
Multi Master
write quorum of 1
Forward and backward iteration over the data. (to compare two time series)
configurable replica distribution
few dependencies

Question:

Wich free Database is able to hold archived data of high redundant crawl data (only a few bytes change) , compresses good and does not use too much time to query a random record. (In opposit to the mysql archive format, that has to decompress the whole table until the requested row)

Maybe there is a log database, that is able to index a lot of log lines and compresses them internaly? (scope of logstash, fluentd, flume)

If someone would know some benchmarks, numbers on this topic it would help a lot, to evaluate the right technology.

I am glad for your help!

hypertable will work great - it's replicated, distributed and compressed (prefix compression for keys, snappy for files on disk.) If replication is not your primary concern then you can also try hamsterdb because it is a column datastore and very efficient i.e. when storing fixed length keys. heavy-weight compression (zlib, snappy...) for records and for large keys is only available in a commercial version. — cruppstahl, Jun 20 '14 at 09:45

Stefan Steiger · Answer 1 · 2014-08-27T12:10:40.900

Assuming you are in a multithreaded environment, possible multi-process, LevelDB is NOT a good idea.

Cassandra is written in Java, therefore you'll see excessive memory consumption when handling a big load of big files, at least without tweaking the JVM. Also, since it's written in Java, it probably won't be fast enough for really good compression.

I use HyperTable on my Linux-box to store photos and movies.
You can use HyperTable from any language with Thrift-support.

Additionally, if you need it, you can use the C++ drivers, for extra-speed.
One thing that is nice about HyperTable is that it doesn't add a dependency on Java, since it's written in C++, which also means it's blazing fast and not garbage-collected (no memory overhead).

Hypertable does have a Java client however, out of the box.
I use my own C# Thrift-client, which I ported from Java.
See >here< for the code.

Since HyperTable operates on Byte-Arrays, you can simply put your file in the thrift-client as byte-array, and HyperTable will compress it for you automagically, if you have told it to do so in the column definition.

You could also try MongoDb, if you absolutely want to.
Mongo actually derives from humongous, by the way.
However, I must say I have never "really" used it.

Need a highly compressed datastore for Crawl data and log data

1 Answers1