Using Hive Jars with Pyspark

Question

The problem statement is usage of hive jars in py-spark code. We are following the below set of standard steps

Create temporary function in pyspark code - spark.sql (" ")

spark.sql("create temporary function public_upper_case_udf as 'com.hive.udf.PrivateUpperCase' using JAR 'gs://hivebqjarbucket/UpperCase.jar'")

Invoke the temporary function in the spark.sql statements

The issue that we are facing is if the java class in jar file is not declared as public explicitly we are facing with the error during spark.sql invocations of the hive udf

org.apache.spark.sql.AnalysisException: No handler for UDF/UDAF/UDTF 'com.hive.udf.PublicUpperCase'

Java Class Code

class PrivateUpperCase extends UDF {
    public String evaluate(String value) {
        return value.toUpperCase();
  }
}

When I make the class public, the issue seems to get resolved.

The query is if making the class public is only solution or is there any other way around it ?

Any assistance is appreciated.

Note - The Hive Jars cannot be converted to Spark UDFs owing to the complexity.

score 0 · Answer 1 · answered Aug 26 '22 at 16:14

If it was not public, how would external packages call PrivateUpperCase.evaluate?

https://www.java-made-easy.com/java-access-modifiers.html

To allow the PrivateUpperCase to be private, the class would need to be in the same package from where PrivateUpperCase.evaluate() is called from. You might be able to hunt that down and set the package name the same, but otherwise it needs to be public.

Using Hive Jars with Pyspark

1 Answers1