Creating User Defined (not temporary) Function in Spark-SQL for Azure Databricks

Question

Maybe its silly, I am an Microsoft SQL / C# Developer and never really have used any other IDE / written JAVA / SCALA before. I am migrating some Azure SQL Queries to an Azure Databricks solution.

There seems to be no equivalent of the TSQL DATEDIFF_BIG function (https://learn.microsoft.com/en-us/sql/t-sql/functions/datediff-transact-sql?view=sql-server-2017)

The solutions you find are - program your own UDF.

Which I have done (see below) in a SCALA Notebook - which works fine for a temporary function. (https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html)

This was the most helpful sample I found https://github.com/johnmuller87/spark-udf.

There are quite a number of temporary function examples, but none I found for permanent functions for non JAVA / SCALA developers.

I installed SBT (Latest Version for Windows - https://www.scala-sbt.org/1.x/docs/Installing-sbt-on-Windows.html) I also installed Intellj

I run the SBT BUILT for the IBAN Sample, but I could not get the SQL function , after uploding the JAR to my Clusterd, and function registration to work.

CREATE FUNCTION ValidateIBAN AS 'com.ing.wbaa.spark.udf.ValidateIBAN' USING JAR 'spark_udf_assembly_0_2_0' --without extension

SELECT ValidateIBAN('NL20INGB0001234567')

the Error was always "Error in SQL statement: AnalysisException: No handler for UDF/UDAF/UDTF 'com.ing.wbaa.spark.udf.ValidateIBAN'; line 1 pos 7"

//import org.apache.spark.sql.types._                         // include the Spark Types to define our schema
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.functions.udf
import java.time.temporal.ChronoUnit;

// Define function to calculate local time offset
def getTimestampDifference(interval: java.lang.String, date1: java.sql.Timestamp, date2: java.sql.Timestamp) : java.lang.Long = {

  //https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html
  //https://spark.apache.org/docs/2.4.0/sql-reference.html
  //https://alvinalexander.com/scala/how-to-use-scala-match-expression-like-switch-case-statement

  interval match
  {
    case "NANOSECOND"=> return ChronoUnit.NANOS.between(date1.toInstant(), date2.toInstant());
    case "MICROSECOND"=> return ChronoUnit.MICROS.between(date1.toInstant(), date2.toInstant());
    case "MILLISECOND"=> return ChronoUnit.MILLIS.between(date1.toInstant(), date2.toInstant()); // date2.getTime() - date1.getTime();
    case "SECOND"=> return ChronoUnit.SECONDS.between(date1.toInstant(), date2.toInstant());
    case "MINUTE"=> return ChronoUnit.MINUTES.between(date1.toInstant(), date2.toInstant());
    case "HOUR"=> return ChronoUnit.HOURS.between(date1.toInstant(), date2.toInstant());
    case "DAY"=> return ChronoUnit.DAYS.between(date1.toInstant(), date2.toInstant());
    case "WEEK"=> return ChronoUnit.WEEKS.between(date1.toInstant(), date2.toInstant());
    case "MONTH"=> return ChronoUnit.MONTHS.between(date1.toInstant(), date2.toInstant());
    case "YEAR"=> return ChronoUnit.YEARS.between(date1.toInstant(), date2.toInstant());
  }
}

spark.udf.register("DATETIMEDIFF", udf(getTimestampDifference(_:java.lang.String, _:java.sql.Timestamp,_:java.sql.Timestamp),LongType))

What I need is actually - how to I transform the SCALA Notebook to an SQL Function so I can use it in a permanent SQL View on Azure Databricks Cluster Version 5.4 (includes Apache Spark 2.4.3, Scala 2.11)

What Class to implement
What Method to implement (override in c#) - there are also different articles about HIVE or SPARK
How to setup the SBT Built or whatever other way to compile it in a Java Archive so I can sucessfully create and run the SQL Function (in SQL only, not in pyhton code, nor in scala code - in an SQL Notebook)

Thanks for your help

score 2 · Answer 1 · answered Jul 10 '19 at 08:01

Spark does not offer you any permanent capabilities lasting for more than a single spark session (Databricks - Creating permanent User Defined Functions (UDFs) or cluster lifetime in Databricks lingo). If you need long-running spark sessions (only the SQL part) you could consider adding these UDF to Hive and call them from Spark. Otherwise (think about transient clusters) you will need to re-add it each time you start the cluster.

The code for your UDF is non-optimal:no handling of empty/null values /it will throw an exception

For a basic (standard) spark UDF see https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-udfs.html there are no real interfaces required (unlike Hive)

Regarding: SQL Function (in SQL only)/ SBT:

If you really need it (for this simple use case) https://github.com/geoHeil/sparkSimpleProjectTemplate.g8 could be an example for you.

But for this code, there are no additional dependencies required. It should be enough to create a text / Scala file which contains the < 100 lines of code required for your function. This file (Notebook?) could then be called on cluster creation using the API i.e. via https://docs.databricks.com/user-guide/dev-tools/databricks-cli.html and some scripting, thus behave like being permanent.

Furthermore: Always consider using spark native (catalyst optimized) functions. DATEDIFF in SPARK SQl regular datediff might already do a lot of what your datediff-big needs to accomplish as well as subtracting plain timestamp type columns. If I understand it correctly from having a brief glance at it only formatting the output to the desired granularity is missing (i.e. would be provided out of the box from the t-SQL function) and could be accomplished by nesting it with different functions like:

year
day
week
or manually dividing the returned difference

I only need the MILLISECOND part of it, and there seems to be no native function. If there is, then I am happy to use it.I will give the https://github.com/geoHeil/sparkSimpleProjectTemplate.g8 a try... — hmayer1980, Jul 10 '19 at 09:04
Just cast the timestamps to longs subtract them and then manually convert to the desired accuracy level could be a way to keep catalyst native functions — Georg Heiler, Jul 10 '19 at 09:40
cast as long is just the same as unix_timestamp()? this is only seconds. If looks like its possible to cast to double, but I do not trust floating point operations with accuracy. I do not know the internal datatype structure of timestamp. is it two longs for the date / time part, or how is it stored. maybe decimal with a fixed precission to nanoseconds would be the best solution? `cast(TS1 as Decimal(20,6))` — hmayer1980, Jul 10 '19 at 15:20

Andrej Spielmann · Accepted Answer · 2019-07-10T12:43:45.100

The CREATE FUNCTION statement in Databricks that you are referencing is actually a Hive command, not Spark, and it expects the UDF class to be a Hive UDF.

That is also the reason for the "No handler for UDF/UDAF/UDTF" error you are getting. The example you have linked implements a Spark UDF, while what you need is to implement a Hive UDF.

To create a Hive UDF, you need to implement a class that extends the class org.apache.hadoop.hive.ql.exec.UDF and implements a function called evaluate. In your case, the whole class should look like this:

class GetTimestampDifference extends UDF {

  def evaluate(interval: java.lang.String, date1: java.sql.Timestamp, date2: java.sql.Timestamp) : java.lang.Long = {

  //https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html
  //https://spark.apache.org/docs/2.4.0/sql-reference.html
  //https://alvinalexander.com/scala/how-to-use-scala-match-expression-like-switch-case-statement

  interval match
  {
    case "NANOSECOND"=> return ChronoUnit.NANOS.between(date1.toInstant(), date2.toInstant());
    case "MICROSECOND"=> return ChronoUnit.MICROS.between(date1.toInstant(), date2.toInstant());
    case "MILLISECOND"=> return ChronoUnit.MILLIS.between(date1.toInstant(), date2.toInstant()); // date2.getTime() - date1.getTime();
    case "SECOND"=> return ChronoUnit.SECONDS.between(date1.toInstant(), date2.toInstant());
    case "MINUTE"=> return ChronoUnit.MINUTES.between(date1.toInstant(), date2.toInstant());
    case "HOUR"=> return ChronoUnit.HOURS.between(date1.toInstant(), date2.toInstant());
    case "DAY"=> return ChronoUnit.DAYS.between(date1.toInstant(), date2.toInstant());
    case "WEEK"=> return ChronoUnit.WEEKS.between(date1.toInstant(), date2.toInstant());
    case "MONTH"=> return ChronoUnit.MONTHS.between(date1.toInstant(), date2.toInstant());
    case "YEAR"=> return ChronoUnit.YEARS.between(date1.toInstant(), date2.toInstant());
  }
}

}

You then need to compile it to a JAR file, copy it somewhere into the databricks filesystem and create the permanent function using the same command as you did before (assuming you keep the namespace of the IBAN example):

CREATE FUNCTION GetTimestampDifference AS 'com.ing.wbaa.spark.udf.GetTimestampDifference' USING JAR '[path to your jar in dbfs]'

SELECT GetTimestampDifference ("MILLISECOND",cast("2019-07-08 16:07:03.246" as timestamp), cast("2019-07-08 16:07:03.248" as timestamp))

Assuming you are still modifying the IBAN example project that you started with, in order to create the jar file you will have to add the following package dependency to the build.sbt file:

"org.apache.spark" %% "spark-hive" % "2.4.3"

Creating User Defined (not temporary) Function in Spark-SQL for Azure Databricks

2 Answers2

Linked