1

I am writing queries in databricks using sql on views and would like to calculate max of dates of update timestamp column across multiple views. for instance i am joining table a with table b and would like to know max(a.updt_ts,b.updt_ts). since max function can not have more than one columns mentioned, i want to create a function. Any help is greatly appreciated. below is what i have:

CREATE temporary FUNCTION ufnGetMaxDt (@Date1 DATETIME2,@Date2 DATETIME2) 
BEGIN  
    DECLARE @ret   DATETIME2
          , @MinDt datetime2;  

    SET @MinDt = cast('1900-01-01' as datetime2);

    IF (@Date1) is null  SET @Date1 = @MinDt; 
    IF (@Date2) is null  SET @Date2 = @MinDt; 


    SET @ret = CASE When @Date1 >= @Date2
                    Then @Date1
                    else @Date2
                    END; 

     IF (@ret IS NULL)   
        SET @ret = @MinDt;  -- Dummy date
    RETURN @ret;  
END
GO
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
codewalker
  • 67
  • 1
  • 6

1 Answers1

1

You could just use greatest? eg

SELECT *, GREATEST( date1, date2 ) xmax 
FROM tmp

Or put them in an array, explode it and then max that? eg something like this:

%sql
WITH cte AS
(
SELECT *, EXPLODE( ARRAY( date1, date2 ) ) xmax 
FROM tmp
)
SELECT MAX( xmax )
FROM cte

Seems a bit excessive when you can just use greatest though? It's also worth having a read through the list of Spark SQL built-in functions. You don't have to remember them all but at least if you know something is possible it's useful:

https://spark.apache.org/docs/2.3.0/api/sql/index.html

wBob
  • 13,710
  • 3
  • 20
  • 37
  • 1
    how would i do that..sorry noob question – codewalker Jul 06 '18 at 21:47
  • 1
    Not only is it easier to maintain a built in function, but it also helps Spark's optimizer in choosing the best physical plan. The reason is that Spark sees the user defined function as a black box, and as such may miss out on some potential for optimizing it. – Dillon Jul 09 '18 at 12:06