1

I've set up an EMR job through Data Pipeline in AWS. This job is to transfer CSV data from S3 to DynamoDB.

My data size is 400 MB. I set mapred.max.split.size = 134217728 (i.e. 128 MB). With that, I'm able to see in monitoring graph that there are 3 map tasks. But these tasks never run in parallel. So, it takes 43 minutes to complete 400 MB. The stderr log for the task always shows the map tasks being run sequentially.

I tried 2 core nodes of various instance types like m1.small, c3.xlarge, c3.2xlarge but to no avail.

Is there any other setting / config or update to be done to make these map tasks run in parallel?

Mouli
  • 1,621
  • 15
  • 20

2 Answers2

0

Check if this helps you : The mapper daemons that Hadoop launches to process your requests to export and query data stored in DynamoDB are capped at a maximum read rate of 1 MiB per second to limit the read capacity used. If you have additional provisioned throughput available on DynamoDB, you can improve the performance of Hive export and query operations by increasing the number of mapper daemons. To do this, you can either increase the number of EC2 instances in your cluster or increase the number of mapper daemons running on each EC2 instance.

You can increase the number of EC2 instances in a cluster by stopping the current cluster and re-launching it with a larger number of EC2 instances. You specify the number of EC2 instances in the Configure EC2 Instances dialog box if you're launching the cluster from the Amazon EMR console, or with the --num-instances option if you're launching the cluster from the CLI.

The number of map tasks run on an instance depends on the EC2 instance type. For more information about the supported EC2 instance types and the number of mappers each one provides, go to Hadoop Configuration Reference in the Amazon EMR Developer Guide. There, you will find a "Task Configuration" section for each of the supported configurations.

Another way to increase the number of mapper daemons is to change the mapred.tasktracker.map.tasks.maximum configuration parameter of Hadoop to a higher value. This has the advantage of giving you more mappers without increasing either the number or the size of EC2 instances, which saves you money. A disadvantage is that setting this value too high can cause the EC2 instances in your cluster to run out of memory. To set mapred.tasktracker.map.tasks.maximum, launch the cluster and specify the Configure Hadoop bootstrap action, passing in a value for mapred.tasktracker.map.tasks.maximum as one of the arguments of the bootstrap action. This is shown in the following example.

--bootstrap-action s3n://elasticmapreduce/bootstrap-actions/configure-hadoop \ --args -s,mapred.tasktracker.map.tasks.maximum=10

For more information about bootstrap actions, see Using Custom Bootstrap Actions in the Amazon EMR Developer Guide.

Mayank Agarwal
  • 376
  • 2
  • 9
  • Yep, tried the mapred.tasktracker.map.tasks.maximum and also launched with larger number of instances too (e.g. 4 core and 4 tasks of c3.xlarge). No use. On DynamoDB side, I have enough throughput. Infact, the job consumes less than half only. I also set the dynamodb.throughput.write.percent=1.0 so all throughput can be used, but that doesn't work either. – Mouli Jun 10 '14 at 16:49
  • have you checked if all of them are workng properly and all slaves are pointing to the same master. – Mayank Agarwal Jun 10 '14 at 16:52
  • In EC2 monitoring, I could all instances constantly using some CPU, so all of them are working properly. Since there's only one master, not sure how they would point to something else. And my file is not in Gzip format. I tried that too, but the Hive activity that I'm using doesn't recognize Gzip despite setting the mapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec – Mouli Jun 10 '14 at 16:57
  • did you set this parameter: mapred.map.tasks.speculative.execution – Mayank Agarwal Jun 10 '14 at 17:04
0

Mayank was correct I think. I ran into a similar issue. 1 Map Task was RUNNING, while the other 9 were at PENDING. I had to increase the #CoreNodes and I can see all Map Tasks as RUNNING.

Pay attention to DynamoDB thruput (read/write) and the capacity of your cluster. In the case of m2.medium, it's 2 mappers/instance by default.

oVo
  • 111
  • 1
  • 3