Hive INSERT OVERWRITE creates 1 large file in every partition

Question

I am executing a hive INSERT OVERWRITE query.

SET hive.merge.mapredfiles=false;
SET mapred.reduce.tasks=${hiveconf:NUM_REDUCERS};
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
SET hive.execution.engine=tez;
SET hive.tez.container.size=6144;

INSERT OVERWRITE TABLE CS.STAGING_AVRO PARTITION (CPD_DT='${hiveconf:CPD_DT}',CUT_NO='${hiveconf:CUT_NO}')
select DISTINCT from CS_TRAN.field1, CS_TRAN.field2 .... 
from ( SELECT A, B, C, ... FROM NMS1.TBL1 where ... )CS_TRAN
join  NMS2.TBL2 where ....

This insert overwrite query joins data from two tables and writes that into the CS.STAGING_AVRO table in partitions by date and cut

the issue that I am facing here is that in 1 partition it is creating 1 big file

For example:

/data/cpd_dt=20230220/cut_no=00/000000_0 --> 3gb size
/data/cpd_dt=20230220/cut_no=01/000000_0 --> 3gb size
/data/cpd_dt=20230220/cut_no=02/000000_0 --> 3gb size

In all the partitions it is creating 1 file of 3gb size. I want to split this 1 large file into many files of 500mb size.

This I am not able to achieve.

Configuration that I am currently using:

SET hive.merge.mapredfiles=false;
SET mapred.reduce.tasks=${hiveconf:NUM_REDUCERS};
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
SET hive.execution.engine=tez;
SET hive.tez.container.size=6144;

I tried the following things but it didn't work:

SET mapred.max.split.size=1000000;
SET hive.exec.reducers.bytes.per.reducer=1000000;
SET hive.merge.size.per.task=1000000;
SET hive.merge.smallfiles.avgsize=1000000;
SET hive.exec.compress.output=true;

Please tell me what needs to be done for the same. Any help is appreciated.

is it in parquet format? What is the block size(default if 512MB)? — Koushik Roy, May 09 '23 at 03:35

score 1 · Answer 1 · answered May 09 '23 at 03:42

Try this

SET dfs.block.size=536870912;  
SET parquet.block.size=536870912;

They will set block size to 512MB. If you want to increase it to 1GB just multiply by 2. But i am afraid you have to recreate the partitioned table, otherwise hive wont create smaller files.

Hive INSERT OVERWRITE creates 1 large file in every partition

1 Answers1