I have to store a lot of crawl and log data in a Datastore with an efficient compression ratio.
So far I tried and installed Cassandra, Couchbase, Mysql and an FlatFile format and read the architectual overview of Google Big Table, Hypertable and the LevelDB File Layout.
Cassandra and Couchbase are about 1/5 in disk size of the uncompressed Mysql Database, but I want better results.
So I need a Simple Data Store with high compression features as in vertica, teradata, oracle and sqlserver products. (Page level compression)
The actual flatFile dataSet looks like
/oil_type/gas_station/2014-03/2014-03-05-23.csv
/oil_type/gas_station/2014-03/2014-03-06-00.csv
/oil_type/gas_station/2014-03/2014-03-06-01.csv
Per File are about 400 high redundant entries each about 5kb A File can be compressed from 1722 KB to 39 KB so an compression ratio of 44:1 up to 100:1 depending on the compression chunk size should be possible.
Defining the use case:
I have to poll all relevant gas_station webpages/apis every 30 seconds to get up to the minute pricing information, because it is not possible to write a parser for every gas station, a generic solution is required for index creation. With a database holding all craweld gas station pages a generic parser can easily be developed and backtestet. With this raw data model data loss through broken specific converters should be avoided.
With keys like "oil_type-gas_station-timestamp-content", its easy and efficient to compare two gas_station pricings over time. For reading a Time Serie that is smaller then the compression chunk size only 2 to 4 chunks should be decompressed.
So the following features are optimal:
- SSTables
- Configurable Compression Options (Level,Compression Engine,Chunk Size (from 64kb to 10 MB))
- Range Scans
- Java Bindings
- column datasore for better compression
Nice to have:
- Replication
- Multi Master
- write quorum of 1
- Forward and backward iteration over the data. (to compare two time series)
- configurable replica distribution
- few dependencies
Question:
Wich free Database is able to hold archived data of high redundant crawl data (only a few bytes change) , compresses good and does not use too much time to query a random record. (In opposit to the mysql archive format, that has to decompress the whole table until the requested row)
Maybe there is a log database, that is able to index a lot of log lines and compresses them internaly? (scope of logstash, fluentd, flume)
If someone would know some benchmarks, numbers on this topic it would help a lot, to evaluate the right technology.
I am glad for your help!