Hive query taking a lot of time just to launch map-reduce jobs

Question

We are using Hive for Ad-hoc querying and have a Hive table which is partitioned on two fields (date,id).

Now for each date there are around 1400 ids so on a single day around that many partitions are added. The actual data is residing in s3. Now the issue we are facing is suppose we do a select count(*) for a month from the table then it takes quite a long amount of time (approx : 1hrs 52 min) just to launch the map reduce job.

When I ran the query in Hive verbose mode I can see that its spending this time actually deciding how many number of mappers to spawn (calculating splits). Is there any means by which I can reduce this lag time for the launch of map-reduce job?

This is one of the log messages that is being logged during this lag time:

13/11/19 07:11:06 INFO mapred.FileInputFormat: Total input paths to process : 1
13/11/19 07:11:06 WARN httpclient.RestS3Service: Response '/Analyze%2F2013%2F10%2F03%2F465' - Unexpected response code 404, expected 200

Look, a JIRA! https://issues.apache.org/jira/browse/HIVE-5851 — Remus Rusanu, Nov 19 '13 at 10:56
we went ahead and changed the hive source code to get around this step of listing and it gave a decent improvement in the hive start up time for queries — Sreenath Kamath, Oct 07 '16 at 07:26

score 1 · Accepted Answer · answered Nov 20 '13 at 22:41

1

This is probably because with an over-partitioned table the query planning phase takes a long time. Worse, the query planning phase itself might take longer than the query execution phase.

One way to overcome this problem would be to tune up your metastore. But the better solution would be to devise an efficient schema and get rid of unnecessary partitions. Trust me, you really don't want too many small partitions.

As an alternative you could also try setting hive.input.format to org.apache.hadoop.hive.ql.io.CombineHiveInputFormat before you issue your query.

HTH

answered Nov 20 '13 at 22:41

Tariq

34,076
8
57
79

I already have set the input format as org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
hive.input.format org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
The default input format. Set this to HiveInputFormat if you encounter problems with CombineHiveInputFormat. – Sreenath Kamath Nov 21 '13 at 05:56
During my initial investigation i created a seperate metastore just for this table just to ensure that metastore is not the bottle neck and it didnt give me any improvements – Sreenath Kamath Nov 21 '13 at 06:05
Did decreasing the number of partitions make any difference? – Tariq Nov 21 '13 at 09:46
yes decreasing the number of partitions helped but that wont be a permanent fix to my solution the ids for a particular date may increase and i can in that case control the execution time – Sreenath Kamath Nov 21 '13 at 12:38

Hive query taking a lot of time just to launch map-reduce jobs

1 Answers1