Im trying to run a simple query on a table with one partition which has around 200-300k records all of them are small files of 120bytes.
I'm using a custom INPUTFORMAT which reads the file contents and then query another s3 file to fetch the actual data.Each File corresponds to one record.
The query is taking around 6 hours to complete. I am using a cluster of 10 machines of type m2.4xlarge on EMR.
Looking in to the logs there is a one hour delay between starting the job and starting the map reduce tasks.Also the number of mappers/tasks are shown as only 1.
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
Is there anything I'm missing? Looks like there is no parallel execution at all. I tried setting the following properties but no improvement at all:
mapreduce.job.counters.limit 1000
mapred.tasktracker.tasks.maximum 1000
mapred.tasktracker.map.tasks.maximum 100
mapred.tasktracker.reduce.tasks.maximum 95
mapred.map.tasks 100
mapred.child.java.opts -Xmx15048m
namenide-heap-size 15048
Below are the tables and query details.
CREATE EXTERNAL TABLE IF NOT EXISTS sample(
x string,
y date,
)
PARTITIONED BY (date STRING)
ROW FORMAT SERDE "com.gts.hive.analytics.store.serde.CustomSerDe"
STORED AS INPUTFORMAT 'com.gts.hive.analytics.store.formats.mapred.GZipJsonFileInputFormat2'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3n://xyzlocation/';
ALTER TABLE sample ADD IF NOT EXISTS PARTITION(date='2013-12-31-07');
select x from sample;