Hive Merge Small ORC Files

Question

My input consists of large number of small ORC files which I would like to merge every end of the day and I would like to split the data into 100MB blocks.

My Input and Output Both Are S3 and Environment using is EMR,

Hive Parameters which am setting,

set hive.msck.path.validation=ignore;
set hive.exec.reducers.bytes.per.reducer=256000000;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.mapred.mode = nonstrict;

set hive.merge.mapredfiles=true;
set hive.merge.mapfile=true ;

set hive.exec.parallel = true;
set hive.exec.parallel.thread.number = 8;

SET hive.exec.stagingdir=/tmp/hive/  ;
SET hive.exec.scratchdir=/tmp/hive/ ;

set mapred.max.split.size=68157440;
set mapred.min.split.size=68157440;
set hive.merge.smallfiles.avgsize=104857600;
set hive.merge.size.per.task=104857600;
set mapred.reduce.tasks=10;

My Insert Statement:

insert into table dev.orc_convert_zzz_18 partition(event_type) select * from dev.events_part_input_18 where event_type = 'ScreenLoad' distribute by event_type;

Now the problem is , I have around 80 input files which are of 500MB size in total and after this insert statement, I was expecting 4 files in S3, but all these files are getting merged into a single file which is not desired output.

Can someone please let me know, what's going wrong ,

@cricket_007 oh ok, Thanks will check . I have just now figured out the answer, we can use clustering to further split the partition into multiple pieces . Am exploring hive here because my spark output has too many small files and if i expose those smaller files via Presto to end users, then querying over those smaller files will be slower https://community.hortonworks.com/content/supportkb/49637/hive-bucketing-and-partitioning.html — Rajiv, Oct 28 '17 at 21:16
You should use `coalesce` or `repartition` in Spark to fix your small file problem — OneCricketeer, Oct 29 '17 at 07:42
yeah, I have tried coalesce, it's taking too much time given the volume of input data. So I am avoiding coalesce at Spark layer and once the events are partitioned, will selectively apply coalesce for a specific set of events either in Hive or in Spark , right now trying the same using Hive — Rajiv, Oct 29 '17 at 08:43
Can't you extract the partitions at the Spark layer? I don't see a need to write to disk, only to read it back other than having some data persistence — OneCricketeer, Oct 29 '17 at 15:07

lev · Answer 1 · 2017-10-29T03:55:49.930

1

you are using 2 different concepts to control the output files:

partition: it set the directories
distribute by: set the files in each directory

if you just want to have 4 files in each directory, you can distribute by just a random number, for example:

insert into table dev.orc_convert_zzz_18 partition(event_type) 
select * from dev.events_part_input_18 
where event_type = 'ScreenLoad' distribute by Cast((FLOOR(RAND()*4.0)) as INT);

but I would recommend distributing by some column in your data that you might query by. It can improve your query times.

can read more about it here

edited Oct 29 '17 at 03:55

answered Oct 29 '17 at 03:24

lev

3,986
4
33
46

Hi @lev I tried this, but am getting 30 partitions, any idea how can I control that, I tried setting reducers to 10 thinking that it will result in 10 files, but still I end up getting 30 – Rajiv Oct 29 '17 at 03:48
you are right, `RAND` returns a double between 0 and 1. I fixed the answer – lev Oct 29 '17 at 03:56
I tried this also , but it didn't worked , please find the screen shot here, not sure what mistake am doing here https://ibb.co/eFqorR – Rajiv Oct 29 '17 at 04:39
tried this as well Cast((FLOOR(RAND()*4.0)) as INT) % 4 , but not luck – Rajiv Oct 29 '17 at 04:46
I tried with an other field named os which has 15 distinct values, so ideally 15 files should have been created, but it's resulting in 30 there as well, screenshot : https://ibb.co/nicOQm – Rajiv Oct 29 '17 at 04:53

Hive Merge Small ORC Files

1 Answers1