1

Im trying to run a simple query on a table with one partition which has around 200-300k records all of them are small files of 120bytes.

I'm using a custom INPUTFORMAT which reads the file contents and then query another s3 file to fetch the actual data.Each File corresponds to one record.

The query is taking around 6 hours to complete. I am using a cluster of 10 machines of type m2.4xlarge on EMR.

Looking in to the logs there is a one hour delay between starting the job and starting the map reduce tasks.Also the number of mappers/tasks are shown as only 1.

 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0

Is there anything I'm missing? Looks like there is no parallel execution at all. I tried setting the following properties but no improvement at all:

mapreduce.job.counters.limit 1000
mapred.tasktracker.tasks.maximum 1000
mapred.tasktracker.map.tasks.maximum 100
mapred.tasktracker.reduce.tasks.maximum 95
mapred.map.tasks 100
mapred.child.java.opts -Xmx15048m
namenide-heap-size 15048

Below are the tables and query details.

CREATE EXTERNAL TABLE IF NOT EXISTS sample(
         x string,
         y date,
     )
       PARTITIONED BY (date STRING)
       ROW FORMAT SERDE "com.gts.hive.analytics.store.serde.CustomSerDe"
       STORED AS INPUTFORMAT 'com.gts.hive.analytics.store.formats.mapred.GZipJsonFileInputFormat2'
       OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
       LOCATION 's3n://xyzlocation/';

ALTER TABLE sample ADD IF NOT EXISTS PARTITION(date='2013-12-31-07');

select x from sample;
James Z
  • 12,209
  • 10
  • 24
  • 44
Ravi
  • 41
  • 1
  • 4
  • How many mappers in the cluster? Where did you set all those properties you listed above (in the *-site.xml file?) You have not provided info to help troubleshoot why the one hour delay: e.g. are there other jobs running - what does the jobtracker say about running jobs? What does JT say about available mapper/reducer slot? – WestCoastProjects Jan 02 '14 at 14:44
  • I am passing all these properties via API when launching the steps during the cluster setup. With and without these settings there is no improvement. There are no other jobs running on the cluster. The 1 hour delay is proportional to the number of records in the partition. I tried one sample partition with 100 records, there is no delay in this case. But once the records count increase the delay becomes more and more. The jobtracker shows that there are 800 map /reduce slots available.But only 1 slot is in use during the entire execution of the job. – Ravi Jan 02 '14 at 18:22
  • How many map/reduce slots does JT says the cluster possesses? – WestCoastProjects Jan 02 '14 at 18:26
  • I have 1 master node, 8 core nodes. The JT shows 800 map slots and 800 reduce slots as open. – Ravi Jan 02 '14 at 18:29
  • You have hit most of the items already. Only other ones I can think of: set mapreduce.input.fileinputformat.split.maxsize=8000000 and check your InputFormat. – WestCoastProjects Jan 02 '14 at 19:17

2 Answers2

1

gzip is not splittable, any Hadoop processing of gzipped data will lead to one mapper. There are splittable compression formats such as bzip you can use. More info available at http://comphadoop.weebly.com/

Carter Shanklin
  • 2,967
  • 21
  • 18
0

You can try to simply add a cluster by rand() at the end of your query, I don't quite understand why but it seems join, group by, cluster by etc enables some sort of shuffle for map-only jobs. otherwise the number of reducers is auto-magically forced to 1.

Zhang Fan
  • 83
  • 8