Hadoop Hive UDF with external library

Question

I'm trying to write a UDF for Hadoop Hive, that parses User Agents. Following code works fine on my local machine, but on Hadoop I'm getting:

org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String MyUDF .evaluate(java.lang.String) throws org.apache.hadoop.hive.ql.metadata.HiveException on object MyUDF@64ca8bfb of class MyUDF with arguments {All Occupations:java.lang.String} of size 1',

Code:

import java.io.IOException;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.*;
import com.decibel.uasparser.OnlineUpdater;
import com.decibel.uasparser.UASparser;
import com.decibel.uasparser.UserAgentInfo;

public class MyUDF extends UDF {

    public String evaluate(String i) {
        UASparser parser = null;         
        parser = new UASparser(); 
        String key = "";
        OnlineUpdater update = new OnlineUpdater(parser, key);
        UserAgentInfo info = null;
        info = parser.parse(i);
        return info.getDeviceType();
    }
}

Facts that come to my mind I should mention:

I'm compiling with Eclipse with "export runnable jar file" and extract required libraries into generated jar option
I'm uploading this "fat jar" file with Hue
Minimum working example I managed to run:

public String evaluate(String i) { return "hello" + i.toString()"; }
I guess the problem lies somewhere around that library (downloaded from https://udger.com) I'm using, but I have no idea where.

Any suggestions?

Thanks, Michal

Did you look into the YARN logs for the `application_xxxx_xxxx` (as reported by Hive) to check for some clues, e.g. some inner Exceptions about your JAR compiled with a version of Java that is more recent than the JRE used by Hive (just an exemple)? — Samson Scharfrichter, Dec 23 '15 at 16:09

Roberto Congiu · Answer 1 · 2015-12-24T08:55:07.647

It could be a few things. Best thing is to check the logs, but here's a list of a few quick things you can check in a minute.

jar does not contain all dependencies. I am not sure how eclipse builds a runnable jar, but it may not include all dependencies. You can do

jar tf your-udf-jar.jar

to see what was included. You should see stuff from com.decibel.uasparser. If not, you have to build the jar with the appropriate dependencies (usually you do that using maven).

Different version of the JVM. If you compile with jdk8 and the cluster runs jdk7, it would also fail
Hive version. Sometimes the Hive APIs change slightly, enough to be incompatible. Probably not the case here, but make sure to compile the UDF against the same version of hadoop and hive that you have in the cluster
You should always check if info is null after the call to parse()
looks like the library uses a key, meaning that actually gets data from an online service (udger.com), so it may not work without an actual key. Even more important, the library updates online, contacting the online service for each record. This means, looking at the code, that it will create one update thread per record. You should change the code to do that only once in the constructor like the following:

Here's how to change it:

public class MyUDF extends UDF {
  UASparser parser = new UASparser();

  public MyUDF() {
    super()
    String key = "PUT YOUR KEY HERE";
    // update only once, when the UDF is instantiated
    OnlineUpdater update = new OnlineUpdater(parser, key);
  }

  public String evaluate(String i) {
        UserAgentInfo info = parser.parse(i);
        if(info!=null) return info.getDeviceType();
        // you want it to return null if it's unparseable
        // otherwise one bad record will stop your processing
        // with an exception
        else return null; 
    }
}

But to know for sure, you have to look at the logs...yarn logs, but also you can look at the hive logs on the machine you're submitting the job on ( probably in /var/log/hive but it depends on your installation).

Our Hadoop machine has been down for a while, so I didn't have a chance to check the logs, but... 1) I checked dependencies, they seem to be ok 2) This was an issuer a step back. However when the version is not compatible, Java throws an exception about wrong version, not IOException / HiveException 3) should be ok 4) I'll try this one 5) It works withou a key (I checked outside Hadoop). I'm aware of the inefficiency, but solving this one should be the next step I think. — Michal, Jan 05 '16 at 23:36
However, another idea came to my mind, while I was going trough the library... It tries to write into a temp file, is this even legal operation for an UDF fucntion (writing onto a filesystem? HDFS is append only system, so I smell some trouble here? ...thanks, I appreciate the help :) — Michal, Jan 05 '16 at 23:39
It is legal for a UDF to read/write a local file but it's definitely not recommended! But it could be done safely in certain cases. At a previous job, we had a configuration file pushed to all the machines, and a UDF that read it and provided its contents to queries. But that library opens a network connection for every record..that's very very inefficient and bad...so yeah, it smells of trouble :) That library was not designed to work with hadoop. When writing a UDF that uses a library, you should be very careful and see how it works internally. — Roberto Congiu, Jan 06 '16 at 00:06
So I found out that the library supports also importing that config from a file. However I'm still getting file not found exception, so my question is, how I do find the path where the file is uploaded? I uploaded it trough Hue, in the same directory as the .jar file. — Michal, Feb 08 '16 at 09:18

score 0 · Answer 2 · edited Dec 28 '17 at 05:57

such a problem probably can be solved by steps:

overide the method UDF.getRequiredJars(), make it returning a hdfs file path list which values are determined by where you put the following xxx_lib folder into your hdfs. Note that , the list mist exactly contains each jar's full hdfs path strings ,such as hdfs://yourcluster/some_path/xxx_lib/some.jar
export your udf code by following "Runnable jar file exporting wizard" (chose "copy required libraries into a sub folder next to the generated jar". This steps will result in a xxx.jar and a lib folder xxx_lib next to xxx.jar
put xxx.jar and the folders xxx_lib to your hdfs filesystem according to your code in step 0.
create a udf using: add jar ${the-xxx.jar-hdfs-path}; create function your-function as $}qualified name of udf class};

Try it. I test this and it works

Hadoop Hive UDF with external library

2 Answers2

Linked