The script below (Spark 1.6) aborts with java.lang.NullPointerException, primarily due to the function LAG. Please advise.
from pyspark.sql import HiveContext
sqlc= HiveContext(sc)
rdd = sc.parallelize([(1, 65), (2, 66), (3, 65), (4, 68), (5, 71)])
df = sqlc.createDataFrame(rdd, ["account_nbr", "date_time"])
df.registerTempTable("test1")
df2 = sqlc.sql("select a.*, case when lag(a.date_time) is NULL then 0 else lag(a.date_time) end as prev_date_time from test1 a")
df2.toPandas()
Alternative is to use functions when and isnull under pyspark.sql.functions and floor the lag to 0 if isnull.
df = df.withColumn("prv_date_time", F.lag(df.date_time).over(my_window))
df = df.withColumn("prv_account_nbr", F.lag(df.account_nbr).over(my_window))
df = df.withColumn("diff_sec", F.when(F.isnull(df.date_time - df.prv_date_time), 0)
.otherwise(df.date_time - df.prv_date_time))