Maybe its silly, I am an Microsoft SQL / C# Developer and never really have used any other IDE / written JAVA / SCALA before. I am migrating some Azure SQL Queries to an Azure Databricks solution.
There seems to be no equivalent of the TSQL DATEDIFF_BIG function (https://learn.microsoft.com/en-us/sql/t-sql/functions/datediff-transact-sql?view=sql-server-2017)
The solutions you find are - program your own UDF.
Which I have done (see below) in a SCALA Notebook - which works fine for a temporary function. (https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html)
This was the most helpful sample I found https://github.com/johnmuller87/spark-udf.
There are quite a number of temporary function examples, but none I found for permanent functions for non JAVA / SCALA developers.
I installed SBT (Latest Version for Windows - https://www.scala-sbt.org/1.x/docs/Installing-sbt-on-Windows.html) I also installed Intellj
I run the SBT BUILT for the IBAN Sample, but I could not get the SQL function , after uploding the JAR to my Clusterd, and function registration to work.
CREATE FUNCTION ValidateIBAN AS 'com.ing.wbaa.spark.udf.ValidateIBAN' USING JAR 'spark_udf_assembly_0_2_0' --without extension
SELECT ValidateIBAN('NL20INGB0001234567')
the Error was always "Error in SQL statement: AnalysisException: No handler for UDF/UDAF/UDTF 'com.ing.wbaa.spark.udf.ValidateIBAN'; line 1 pos 7"
//import org.apache.spark.sql.types._ // include the Spark Types to define our schema
import org.apache.spark.sql.types.LongType
import org.apache.spark.sql.functions.udf
import java.time.temporal.ChronoUnit;
// Define function to calculate local time offset
def getTimestampDifference(interval: java.lang.String, date1: java.sql.Timestamp, date2: java.sql.Timestamp) : java.lang.Long = {
//https://docs.oracle.com/javase/8/docs/api/java/sql/Timestamp.html
//https://spark.apache.org/docs/2.4.0/sql-reference.html
//https://alvinalexander.com/scala/how-to-use-scala-match-expression-like-switch-case-statement
interval match
{
case "NANOSECOND"=> return ChronoUnit.NANOS.between(date1.toInstant(), date2.toInstant());
case "MICROSECOND"=> return ChronoUnit.MICROS.between(date1.toInstant(), date2.toInstant());
case "MILLISECOND"=> return ChronoUnit.MILLIS.between(date1.toInstant(), date2.toInstant()); // date2.getTime() - date1.getTime();
case "SECOND"=> return ChronoUnit.SECONDS.between(date1.toInstant(), date2.toInstant());
case "MINUTE"=> return ChronoUnit.MINUTES.between(date1.toInstant(), date2.toInstant());
case "HOUR"=> return ChronoUnit.HOURS.between(date1.toInstant(), date2.toInstant());
case "DAY"=> return ChronoUnit.DAYS.between(date1.toInstant(), date2.toInstant());
case "WEEK"=> return ChronoUnit.WEEKS.between(date1.toInstant(), date2.toInstant());
case "MONTH"=> return ChronoUnit.MONTHS.between(date1.toInstant(), date2.toInstant());
case "YEAR"=> return ChronoUnit.YEARS.between(date1.toInstant(), date2.toInstant());
}
}
spark.udf.register("DATETIMEDIFF", udf(getTimestampDifference(_:java.lang.String, _:java.sql.Timestamp,_:java.sql.Timestamp),LongType))
What I need is actually - how to I transform the SCALA Notebook to an SQL Function so I can use it in a permanent SQL View on Azure Databricks Cluster Version 5.4 (includes Apache Spark 2.4.3, Scala 2.11)
- What Class to implement
- What Method to implement (override in c#) - there are also different articles about HIVE or SPARK
- How to setup the SBT Built or whatever other way to compile it in a Java Archive so I can sucessfully create and run the SQL Function (in SQL only, not in pyhton code, nor in scala code - in an SQL Notebook)
Thanks for your help