-1

I'm running a Python streaming job on Amazon's Elastic MapReduce which needs to output multiple files from the reducer. The descriptions I've found on the web of how to do this have all been old, so they reference the deprecated property mapred.work.output.dir but when I attempt to create files in the directory pointed to by the modern equivalent, mapreduce.task.output.dir (ie mapreduce_task_output_dir for streaming jobs) I get a File or Directory Not Found error:

OSError: [Errno 2] No such file or directory: 's3://mybucket-data/output/encounter/_temporary/1/_temporary/attempt_1416321762038_0001_r_000003_0'

The documentation for FileOutputFormat.getWorkOutputPath() seems to indicate that this should still work.

I suspect the issue has to do with this pointing to S3, but I don't know if I should be using a different (ie local) directory (if so, what property do I need?) or figuring out how to get Python to write to S3 or ...?

Tom Morris
  • 10,490
  • 32
  • 53

1 Answers1

2

When I have received this error in the past, it was because I didn't have an IAM role properly defined for my EMR cluster.

Do you have "s3:*", included in your IAM role Actions for your EMR cluster?

For example:

    {
  "Statement": [
    {
      "Action": [
        "cloudwatch:*",
        "ec2:Describe*",
        "elasticmapreduce:Describe*",
        "s3:*",
        "sdb:*",
        "sns:*",
        "sqs:*"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

With IAM Roles, AWS Services have permissions to act on other AWS resources. In your instance, you may need to give your EMR cluster permissions to write to S3, otherwise you will get an error saying the S3 bucket was not found.

Amazon provides a quick tutorial on the basics of setting up an EMR IAM Role: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html

bird_spock
  • 21
  • 2