2

Pig 0.12 introduced streaming python UDFs, but they're experimental, so they need Hadoop 1.

http://pig.apache.org/docs/r0.12.1/udf.html#python-udfs

However, the only Amazon-provided AMI that can use pig 0.12 is AMI 3.1.0, which uses hadoop 2.4, not Hadoop 1:

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html

So, the only AMI that supports the right version of pig doesn't support the right version of hadoop. Is there a way to get streaming UDFs working on EMR?

warbaker
  • 307
  • 3
  • 9

1 Answers1

2

You can install your own version of the Pig on EMR using a bootstrap action. You will need to create a cluster without Pig already installed on a version of AMI (2.4.5?) - and then install a version of Pig you like (0.12)

user1452132
  • 1,758
  • 11
  • 21
  • I use EMR AMI 3.0.4 with Apache Pig 0.11.1.1 preinstalled and I just extract Apache Pig 0.13.0 from tarball and update PATH to point to 0.13.0 rather than 0.11.0. I would assume that same cloud be done also with older AMIs. – Mikko Kupsu Sep 05 '14 at 18:27
  • This should work too. However, Pig is not part of AMI itself - but installed during instantiation of the cluster. So, you can change your cluster definition to not have it preinstalled. – user1452132 Sep 06 '14 at 12:24