0

I am developing an application that try to read log file stored in S3 bucks and parse it using Elastic MapReduce. Current the log file has following format

------------------------------- 
COLOR=Black 
Date=1349719200 
PID=23898 
Program=Java 
EOE 
------------------------------- 
COLOR=White 
Date=1349719234 
PID=23828 
Program=Python 
EOE 

So I try to load the file into my Pig script, but the build-in Pig Loader doesn't seems be able to load my data, so I have to create my own UDF. Since I am pretty new to Pig and Hadoop, I want to try script that written by others before I write my own, just to get a teast of how UDF works. I found one from here http://pig.apache.org/docs/r0.10.0/udf.html, there is a SimpleTextLoader. In order to compile this SimpleTextLoader, I have to add a few imports, as

import java.io.IOException; 
import java.util.ArrayList;
import org.apache.hadoop.io.Text; 
import org.apache.hadoop.mapreduce.Job; 
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 
import org.apache.hadoop.mapreduce.InputFormat; 
import org.apache.hadoop.mapreduce.RecordReader; 
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 
import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit; 
import org.apache.pig.backend.executionengine.ExecException; 
import org.apache.pig.data.Tuple; 
import org.apache.pig.data.TupleFactory;
import org.apache.pig.data.DataByteArray; 
import org.apache.pig.PigException; 
import org.apache.pig.LoadFunc;

Then, I found out I need to compile this file. I have to download svn and pig running

sudo apt-get install subversion 
svn co http://svn.apache.org/repos/asf/pig/trunk 
ant

Now i have a pig.jar file, then I try to compile this file.

javac -cp ./trunk/pig.jar SimpleTextLoader.java 
jar -cf SimpleTextLoader.jar SimpleTextLoader.class 

It compiles successful, and i type in Pig entering grunt, in grunt i try to load the file, using

grunt> register file:/home/hadoop/myudfs.jar
grunt> raw = LOAD 's3://mys3bucket/samplelogs/applog.log' USING myudfs.SimpleTextLoader('=') AS (key:chararray, value:chararray); 

2012-12-05 00:08:26,737 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2998: Unhandled internal error. org/apache/pig/LoadFunc Details at logfile: /home/hadoop/pig_1354666051892.log

Inside the pig_1354666051892.log, it has

Pig Stack Trace
---------------
ERROR 2998: Unhandled internal error. org/apache/pig/LoadFunc

java.lang.NoClassDefFoundError: org/apache/pig/LoadFunc

I also try to use another UDF (UPPER.java) from http://wiki.apache.org/pig/UDFManual, and I am still get the same error by try to use UPPER method. Can you please help me out, what's the problem here? Much thanks!

UPDATE: I did try EMR build-in Pig.jar at /home/hadoop/lib/pig/pig.jar, and get the same problem.

Simon Guo
  • 2,776
  • 4
  • 26
  • 35
  • Why don't you use the built-in Pig support in EMR? – Guy Dec 05 '12 at 11:02
  • yeah, I did use the build-in pig at /home/hadoop/lib/pig/pig.jar but still get the same error. I also specifically registered this pig.jar file in my script. But still get the same error. – Simon Guo Dec 05 '12 at 17:40
  • You can simply provide the Pig script that you developed locally to the EMR by putting it in S3. All the configuration and bootstrapping is taken care by AWS for you. – Guy Dec 05 '12 at 20:39
  • I am running in the interactive mode. – Simon Guo Dec 05 '12 at 21:05
  • 1
    EMR is better to run in scale. I suggest you to develop your script in local mode on your machine and only deployed the final script to EMR. Anyway, note that you can have a several version of Hadoop in the EMR cluster. Maybe you didn't select the correct version that you wanted. – Guy Dec 05 '12 at 21:24
  • I am running Hadoop 1.0.3 (Amazon Distribution). Just wondering by looking at the process I am compile my UDF java file, do you find any mistake int he process I am compiling it? Inside the pig.jar, it did included a org.apache.pig.LoadFunc.class. The error reporting NoClassDefFoundError doesn't seems help me to find the problem. OR do you have any experience for writing UDF in pig that would share with me? I am pretty new to Pig UDF. – Simon Guo Dec 05 '12 at 21:54
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/20664/discussion-between-simon-guo-and-guy) – Simon Guo Dec 05 '12 at 22:08

2 Answers2

0

Put the UDF jar in the /home/hadoop/lib/pig directory or copy the pig-*-amzn.jar file to /home/hadoop/lib and it will work.

You would probably use a bootstrap action to do either of these.

ubiquitousthey
  • 467
  • 3
  • 9
0

Most of the Hadoop ecosystem tools like pig and hive look up $HADOOP_HOME/conf/hadoop-env.sh for environment variables.

I was able to resolve this issue by adding pig-0.13.0-h1.jar (it contains all the classes required by the UDF) to the HADOOP_CLASSPATH:

export HADOOP_CLASSPATH=/home/hadoop/pig-0.13.0/pig-0.13.0-h1.jar:$HADOOP_CLASSPATH

pig-0.13.0-h1.jar is available in the Pig home directory.

Suresh Vadali
  • 139
  • 1
  • 3