How can we pass arguments for Hadoop Streaming from AWS SDK for PHP?

Question

I'm trying to add some job via AWS SDK for PHP. I'm able to successfully start a cluster and start new job flow via API but I'm getting an error while trying to create Hadoop Streaming step.

Here is my code:

// add some jobflow steps
$response = $emr->add_job_flow_steps($JobFlowId, array(
    new CFStepConfig(array(
        'Name' => 'MapReduce Step 1. Test',
        'ActionOnFailure' => 'TERMINATE_JOB_FLOW',
        'HadoopJarStep' => array(
    'Jar' => '/home/hadoop/contrib/streaming/hadoop-streaming.jar',
            // ERROR IS HERE!!!! How can we pas the parameters?
    'Args' => array(
                '-input s3://logs-input/appserver1 -output s3://logs-input/job123/ -mapper s3://myscripts/mapper-apache.php -reducer s3://myscripts/reducer.php',
              ),
        )
   )),
));

I'm getting error like: Invalid streaming parameter '-input s3://.... -output s3://..... -mapper s3://....../mapper.php -reducer s3://...../reducer.php"

So it is not clear how can I pass the arguments to Hadoop Streaming JAR ?

Official AWS SDK for PHP documentation doesn't provides any examples or documentation.

Possibly related unanswered thread:

Pass parameters to hive script using aws php sdk

score 1 · Accepted Answer · answered Apr 24 '12 at 02:01

1

This worked for me:

'Args' => array( '-input','s3://mybucket/in/','-output','s3://mybucket/oo/',
                '-mapper','s3://mybucket/c/mapperT1.php',
                    '-reducer','s3://mybucket/c/reducerT1.php')

answered Apr 24 '12 at 02:01

user1179295

706
3
10
21

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

I haven't performed these steps with the AWS SDK for PHP yet, but from other environments I'd figure that the way you specify the Amazon S3 locations might not be correct - I think they need to be as follows for your input and output parameters:

s3n://logs-input/appserver1
s3n://logs-input/job123/

Please note usage of the s3n: vs. s3: URI scheme, which might be a requirement for Amazon EMR as per the respective FAQ How does Amazon Elastic MapReduce use Amazon EC2 and Amazon S3?:

Customers upload their input data and a data processing application into Amazon S3. Amazon Elastic MapReduce then launches a number of Amazon EC2 instances as specified by the customer. The service begins the job flow execution while pulling the input data from Amazon S3 using S3N protocol into the launched Amazon EC2 instances. Once the job flow is finished, Amazon Elastic MapReduce transfers the output data to Amazon S3, where customers can then retrieve it or use as input in another job flow. [emphasis mine]

Appendix

The difference between the two URI schemes is explained in the Hadoop Wiki, see AmazonS3:

Hadoop provides two filesystems that use S3.

S3 Native FileSystem (URI scheme: s3n) A native filesystem for reading and writing regular files on S3. The advantage of this filesystem is that you can access files on S3 that were written with other tools. Conversely, other tools can access files written using Hadoop. The disadvantage is the 5GB limit on file size imposed by S3. For this reason it is not suitable as a replacement for HDFS (which has support for very large files).

S3 Block FileSystem (URI scheme: s3) A block-based filesystem backed by S3. Files are stored as blocks, just like they are in HDFS. This permits efficient implementation of renames. This filesystem requires you to dedicate a bucket for the filesystem - you should not use an existing bucket containing files, or write other files to the same bucket. The files stored by this filesystem can be larger than 5GB, but they are not interoperable with other S3 tools.

Thanks, Steffen, for such a reasonable notation. Will take that into consideration and hope this information can be useful for others. Anyway I'm still looking for solution for my problem. Posted the similar issue to Amazon ERM forum: https://forums.aws.amazon.com/thread.jspa?messageID=333121񑕁 Hope finally we will be able to find the answer. — webdevbyjoss, Apr 02 '12 at 15:56
@webdevbyjoss - I see you already explored quite some variations; peeking at the SDK source suggests that the 3rd variation should yield an identical result to the 1st used here, whereas the 2nd one using a key shouldn't work at all. What errors are you getting for the 2nd and 3rd variation? — Steffen Opel, Apr 02 '12 at 16:08
I believe the problem is that Amazon ERM SDK for PHP lacks the documentation almost completely: http://aws.amazon.com/sdkforphp/ Information presented in SDK shows how to use hive & Pig but doesn't presents the job step that will utilise Hadoop Streaming http://docs.amazonwebservices.com/AWSSDKforPHP/latest/#m=AmazonEMR/run_job_flow There are no **working** examples for Elastic MapREduce usage in SDK itself: https://github.com/amazonwebservices/aws-sdk-for-php/tree/master/_samples So I would like to tell guys from Amazon SDK development team about poor documentation problem for PHP. — webdevbyjoss, Apr 02 '12 at 16:25
Error I'm receiving are quite similar to one posted here, showing that arguments are passed incorrectly. Looking into SDK source code right now in order to understand how it works. — webdevbyjoss, Apr 02 '12 at 16:29

How can we pass arguments for Hadoop Streaming from AWS SDK for PHP?

2 Answers2

Appendix