Using AWS Data Pipeline PigActivity

Question

I am trying to get a simple PigActivity to work in Data Pipeline. http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-pigactivity.html#pigactivity

The Input and Output fields are required for this activity. I have them both set to use S3DataNode. Both of these DataNodes have a directoryPath which point to my s3 input and output. I originally tried to use filePath but got the following error:

PigActivity requires 'directoryPath' in 'Output' object.

I am using a custom pig script, also located in S3.

My question is how do I reference these input and output paths in my script?

The example given on the reference uses the stage field (which can be disabled/enabled). My understanding is that this used to convert the data into tables. I don't want to do this as it also requires that you specify a dataFormat field.

Determines whether staging is enabled and allows your Pig script to have access to the staged-data tables, such as ${INPUT1} and ${OUTPUT1}.

I have disabled staging and I am trying to access the data in my script as follows:

input = LOAD '$Input';

But I get the following error:

IOException. org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : Input

I have tried using:

 input = LOAD '${Input}';

But I get an error for this too.

There is the optional scriptVariable field. Do I have to use some sort of mapping here?

score 0 · Answer 1 · answered Jun 24 '16 at 22:13

0

Just using

LOAD 'uri to your s3'

shall work.

Normally this is done for you in staging (table creation) and you do not have to access the URI directly from script and only specify it in S3DataNode.

answered Jun 24 '16 at 22:13

luk75

44
2

score 0 · Answer 2 · edited Jun 28 '17 at 13:23

0

Make sure you have set the "stage" property of "pigActivity" to be true.

Once I did that the script below started working for me:

part  = LOAD ${input1} USING PigStorage(',') AS (p_partkey,p_name,p_mfgr,p_category,p_brand1,p_color,p_type,p_size,p_container);
grpd = GROUP part BY p_color;
${output1} = FOREACH grpd GENERATE group, COUNT(part);

edited Jun 28 '17 at 13:23

Colwin

2,655
3
25
25

answered Jun 28 '17 at 09:26

Mahant Kumar Tiwari

1

Using AWS Data Pipeline PigActivity

2 Answers2