3

My input consists of large number of small ORC files which I would like to merge every end of the day and I would like to split the data into 100MB blocks.

My Input and Output Both Are S3 and Environment using is EMR,

Hive Parameters which am setting,

set hive.msck.path.validation=ignore;
set hive.exec.reducers.bytes.per.reducer=256000000;
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.mapred.mode = nonstrict;

set hive.merge.mapredfiles=true;
set hive.merge.mapfile=true ;

set hive.exec.parallel = true;
set hive.exec.parallel.thread.number = 8;

SET hive.exec.stagingdir=/tmp/hive/  ;
SET hive.exec.scratchdir=/tmp/hive/ ;

set mapred.max.split.size=68157440;
set mapred.min.split.size=68157440;
set hive.merge.smallfiles.avgsize=104857600;
set hive.merge.size.per.task=104857600;
set mapred.reduce.tasks=10;

My Insert Statement:

insert into table dev.orc_convert_zzz_18 partition(event_type) select * from dev.events_part_input_18 where event_type = 'ScreenLoad' distribute by event_type;

Now the problem is , I have around 80 input files which are of 500MB size in total and after this insert statement, I was expecting 4 files in S3, but all these files are getting merged into a single file which is not desired output.

Can someone please let me know, what's going wrong ,

Rajiv
  • 392
  • 6
  • 22
  • `mapred` properties are all deprecated – OneCricketeer Oct 28 '17 at 20:41
  • 1
    @cricket_007 oh ok, Thanks will check . I have just now figured out the answer, we can use clustering to further split the partition into multiple pieces . Am exploring hive here because my spark output has too many small files and if i expose those smaller files via Presto to end users, then querying over those smaller files will be slower https://community.hortonworks.com/content/supportkb/49637/hive-bucketing-and-partitioning.html – Rajiv Oct 28 '17 at 21:16
  • You should use `coalesce` or `repartition` in Spark to fix your small file problem – OneCricketeer Oct 29 '17 at 07:42
  • yeah, I have tried coalesce, it's taking too much time given the volume of input data. So I am avoiding coalesce at Spark layer and once the events are partitioned, will selectively apply coalesce for a specific set of events either in Hive or in Spark , right now trying the same using Hive – Rajiv Oct 29 '17 at 08:43
  • Can't you extract the partitions at the Spark layer? I don't see a need to write to disk, only to read it back other than having some data persistence – OneCricketeer Oct 29 '17 at 15:07

1 Answers1

1

you are using 2 different concepts to control the output files:

  • partition: it set the directories
  • distribute by: set the files in each directory

if you just want to have 4 files in each directory, you can distribute by just a random number, for example:

insert into table dev.orc_convert_zzz_18 partition(event_type) 
select * from dev.events_part_input_18 
where event_type = 'ScreenLoad' distribute by Cast((FLOOR(RAND()*4.0)) as INT);

but I would recommend distributing by some column in your data that you might query by. It can improve your query times.

can read more about it here

lev
  • 3,986
  • 4
  • 33
  • 46
  • Hi @lev I tried this, but am getting 30 partitions, any idea how can I control that, I tried setting reducers to 10 thinking that it will result in 10 files, but still I end up getting 30 – Rajiv Oct 29 '17 at 03:48
  • you are right, `RAND` returns a double between 0 and 1. I fixed the answer – lev Oct 29 '17 at 03:56
  • I tried this also , but it didn't worked , please find the screen shot here, not sure what mistake am doing here https://ibb.co/eFqorR – Rajiv Oct 29 '17 at 04:39
  • tried this as well Cast((FLOOR(RAND()*4.0)) as INT) % 4 , but not luck – Rajiv Oct 29 '17 at 04:46
  • I tried with an other field named os which has 15 distinct values, so ideally 15 files should have been created, but it's resulting in 30 there as well, screenshot : https://ibb.co/nicOQm – Rajiv Oct 29 '17 at 04:53