Multiple compression on same hive table

Question

I have a Hive table partitioned by Year/Month and it contains data for at least 7 years. What I want to do it compress the latest data (like upto 1 year old) through Snappy but the older data through a better compression technique like gzip etc. How can I perform this in Hive?

Are you sure you want to compress your data using gzip? This format is not splittable so that you will have serious performance issues when you want to query the data which is split into blocks. — dbustosp, Jun 11 '18 at 02:49
I mean, I just gave an example. I would still like to compress hardly used old data to some other format where compression ratio is better (even at the cost of performance since I would be using that data very infrequently) — Bheem Singh, Jun 11 '18 at 04:57

score 0 · Answer 1 · edited Jun 11 '18 at 15:07

You can overwrite different partitions with different compression settings. Configure compression codec and do insert overwrite partitions you want to be compressed using this codec.

For snappy:

set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec; 
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;

Overwrite partitions using snappy codec:

--enable dynamic partitions
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

--use where conditions to limit new data only
insert overwrite table table_name partition(partition_key)
select * from table_name where partition_key <your condition here> ;

For gzip use GzipCodec:

set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec; 
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec;

Overwrite partitions using gzipcodec:

--enable dynamic partitions
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

--use where conditions to limit OLD data only
insert overwrite table table_name partition(partition_key)
select * from table_name where partition_key <your condition here> ;

In such way you can use different compression codecs for different partitions. And you do not need to specify codec during selecting these table. Hive will automatically recognize which codec should be used for decompression.

Of course this is not about self-contained file formats like ORC or parquet. They can have their own compression properties. For example orc.compress=SNAPPY

Bzip2 would be a better choice than Gzip for any potential querying — OneCricketeer, Jun 11 '18 at 13:32

Multiple compression on same hive table

1 Answers1