0

I was trying my first UDF in pig and wrote the following function -

package com.pig.in.action.assignments.udf;

import org.apache.pig.EvalFunc;
import org.apache.pig.PigWarning;
import org.apache.pig.data.Tuple;

import java.io.IOException;


public class CountLength extends EvalFunc<Integer> {

    public Integer exec(Tuple inputVal) throws IOException {

        // Validate Input Value ...
        if (inputVal == null ||
            inputVal.size() == 0 ||
            inputVal.get(0) == null) {

            // Emit warning text for user, and skip this iteration
            super.warn("Inappropriate parameter, Skipping ...",
                       PigWarning.SKIP_UDF_CALL_FOR_NULL);
            return null;
        }

        // Count # of characters in this string ...
        final String inputString = (String) inputVal.get(0);

        return inputString.length();

    }

}

However, when I try to use it as follows, Pig throws an error message that it not easy to understand atleast for me in the context of my UDF :

grunt> cat dept.txt;
10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

grunt> dept = LOAD '/user/sgn/dept.txt' USING PigStorage(',') AS (dept_no: INT, d_name: CHARARRAY, d_loc: CHARARRAY);
grunt> d = FOREACH dept GENERATE dept_no, com.pig.in.action.assignments.udf.CountLength(d_name);

2015-06-02 16:24:13,416 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 2, column 79>  mismatched input '(' expecting SEMI_COLON
Details at logfile: /home/sgn/pig_1433261973141.log

Can anyone help me figuring out whats wrong with this ?

I have gone through the documentation, but nothing seems obvious to me that is wrong in the sample above. Am I missing something here ?

These are the libraries I am using in pom.xml :

<dependency>
    <groupId>org.apache.pig</groupId>
    <artifactId>pig</artifactId>
    <version>0.14.0</version>
</dependency>

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-core</artifactId>
    <version>1.2.1</version>
</dependency>

Is there any compatibility problem ?

Thanks,

-Vipul Pathak;

sgsi
  • 382
  • 1
  • 8
  • 18

3 Answers3

3

Found the reason of the problem after about 36 hours of downtime ...

The package name contains "IN" which somehow was the problem to Pig.

package com.pig.in.action.assignments.udf;
//              ^^

When I changed the package name to the following, everything was good -

package com.pig.nnn.action.assignments.udf;
//              ^^^

After building my modified UDF, I registered the Jar and Defined an alias for the function name and bingo, everything worked -

REGISTER /user/sgn/UDFs/Pig/CountLength-1.jar;
DEFINE  CL  com.pig.nnn.action.assignments.udf.CountLength;

.   .   .
.   .   .
d = FOREACH dept GENERATE dept_no, CL(d_name) AS DeptLength;

I don't recall if IN is a reserve word in Pig. But still presence of IN causes problem, (atleast in version 0.14.0 of Pig).

sgsi
  • 382
  • 1
  • 8
  • 18
  • IN clause got introduced in v 0.12 of pig. Refer : http://hortonworks.com/blog/announcing-apache-pig-0-12/. Surprised with the root cause of this issue :) – Murali Rao Jun 04 '15 at 01:36
  • yes...Pig has issues with these kind of keywords....once have to be carefur about that..nice job man.. – Amaresh Jun 04 '15 at 01:38
2

Tried the above example. As long as the jar is registered using REGISTER command and the jar is available in classpath, we should not be seeing any error.

REGISTER myudfs.jar;
dept = LOAD 'a.csv' USING PigStorage(',') AS (dept_no: INT, d_name: CHARARRAY, d_loc: CHARARRAY);
d = FOREACH dept GENERATE dept_no, CountLength(d_name) as length;

Input : a.csv

10,ACCOUNTING,NEW YORK
20,RESEARCH,DALLAS
30,SALES,CHICAGO
40,OPERATIONS,BOSTON

Output : d

(10,10)
(20,8)
(30,5)
(40,10)

N.B. : In the above run the class CountLength has been defined in a default package.

If this class - CountLength has been defined in a package com.pig.utility then to access the UDF, either we have to have a DEFINE statement as below

DEFINE CountLength com.pig.utility.CountLength;

OR

We have to refer the UDF by complete path as below :

d = FOREACH dept GENERATE dept_no, com.pig.utility.CountLength(d_name) as length;
Murali Rao
  • 2,287
  • 11
  • 18
  • 1
    You need to include the package as well when calling it unless you define an alias for it: `DEFINE CountLength com.pig.in.action.assignments.udf.CountLength();` – Balduz Jun 03 '15 at 08:17
  • 1
    @Balduz : Agreed that we need to have DEFINE declaration, if we are having CountLength class in a package. In the above test run I have defined the class CountLength in a default package because of which DEFINE statement is not required. – Murali Rao Jun 03 '15 at 08:39
  • Thanks @Balduz and Murali, I have already tried these steps but I have absolutely no idea why it is not working for me. It complains about '(' and says a SEMI_COLON was expected. – sgsi Jun 03 '15 at 23:43
  • Found the problem. Pls. check details below. – sgsi Jun 04 '15 at 01:32
1

Your jar should be registered ex:

REGISTER /home/hadoop/udf.jar;  

DEFINE package.CountLength CountLength ;
Amaresh
  • 3,231
  • 7
  • 37
  • 60
  • Thanks Aman, like I commented above, both these steps I have performed and tried various combinations of putting my JAR at different locations, changing the name of Jar, placing the Jar in HDFS etc. nothing seems to work. I suspect something really stupid should be the reason .... Irony however is that I don't see what might be the reaon of my UDF not being recognized. – sgsi Jun 03 '15 at 23:46
  • Found the problem. Pls. check details below. – sgsi Jun 04 '15 at 01:32