We are using Hive for Ad-hoc querying and have a Hive table which is partitioned on two fields (date,id)
.
Now for each date there are around 1400 ids so on a single day around that many partitions are added. The actual data is residing in s3. Now the issue we are facing is suppose we do a select count(*)
for a month from the table then it takes quite a long amount of time (approx : 1hrs 52 min) just to launch the map reduce job.
When I ran the query in Hive verbose mode I can see that its spending this time actually deciding how many number of mappers to spawn (calculating splits). Is there any means by which I can reduce this lag time for the launch of map-reduce job?
This is one of the log messages that is being logged during this lag time:
13/11/19 07:11:06 INFO mapred.FileInputFormat: Total input paths to process : 1
13/11/19 07:11:06 WARN httpclient.RestS3Service: Response '/Analyze%2F2013%2F10%2F03%2F465' - Unexpected response code 404, expected 200