I am executing a hive INSERT OVERWRITE query.
SET hive.merge.mapredfiles=false;
SET mapred.reduce.tasks=${hiveconf:NUM_REDUCERS};
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
SET hive.execution.engine=tez;
SET hive.tez.container.size=6144;
INSERT OVERWRITE TABLE CS.STAGING_AVRO PARTITION (CPD_DT='${hiveconf:CPD_DT}',CUT_NO='${hiveconf:CUT_NO}')
select DISTINCT from CS_TRAN.field1, CS_TRAN.field2 ....
from ( SELECT A, B, C, ... FROM NMS1.TBL1 where ... )CS_TRAN
join NMS2.TBL2 where ....
This insert overwrite query joins data from two tables and writes that into the CS.STAGING_AVRO table in partitions by date and cut
the issue that I am facing here is that in 1 partition it is creating 1 big file
For example:
- /data/cpd_dt=20230220/cut_no=00/000000_0 --> 3gb size
- /data/cpd_dt=20230220/cut_no=01/000000_0 --> 3gb size
- /data/cpd_dt=20230220/cut_no=02/000000_0 --> 3gb size
In all the partitions it is creating 1 file of 3gb size. I want to split this 1 large file into many files of 500mb size.
This I am not able to achieve.
Configuration that I am currently using:
SET hive.merge.mapredfiles=false;
SET mapred.reduce.tasks=${hiveconf:NUM_REDUCERS};
SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;
SET hive.execution.engine=tez;
SET hive.tez.container.size=6144;
I tried the following things but it didn't work:
SET mapred.max.split.size=1000000;
SET hive.exec.reducers.bytes.per.reducer=1000000;
SET hive.merge.size.per.task=1000000;
SET hive.merge.smallfiles.avgsize=1000000;
SET hive.exec.compress.output=true;
Please tell me what needs to be done for the same. Any help is appreciated.