0

The script below (Spark 1.6) aborts with java.lang.NullPointerException, primarily due to the function LAG. Please advise.

from pyspark.sql import HiveContext
sqlc= HiveContext(sc) 

rdd = sc.parallelize([(1, 65), (2, 66), (3, 65), (4, 68), (5, 71)])

df = sqlc.createDataFrame(rdd, ["account_nbr", "date_time"])
df.registerTempTable("test1")

df2 = sqlc.sql("select a.*, case when lag(a.date_time) is NULL then 0 else  lag(a.date_time) end  as prev_date_time from test1 a")
df2.toPandas()

Alternative is to use functions when and isnull under pyspark.sql.functions and floor the lag to 0 if isnull.

df = df.withColumn("prv_date_time", F.lag(df.date_time).over(my_window))
df = df.withColumn("prv_account_nbr", F.lag(df.account_nbr).over(my_window))
df = df.withColumn("diff_sec", F.when(F.isnull(df.date_time -     df.prv_date_time), 0)
                          .otherwise(df.date_time - df.prv_date_time))
  • Can you please post the stacktrace? – Justin Pihony Jun 09 '17 at 17:57
  • >>> df2.toPandas() 17/06/09 14:00:10 ERROR Executor: Exception in task 3.0 in stage 27.0 (TID 193) java.lang.NullPointerException at org.apache.hadoop.hive.ql.udf.generic.GenericUDFLeadLag.evaluate(GenericUDFLeadLag.java:57) at org.apache.spark.sql.hive.HiveGenericUDF.eval(hiveUDFs.scala:255) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalExpr2$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.Proj – user12345dr Jun 09 '17 at 18:03
  • Don't post it as a comment. Edit your question. – M-- Jun 09 '17 at 21:21

1 Answers1

0

Lag is an analytic function, it gives you access to more than 1 row at the same time, so you need to order over some value. For example, this works:

rdd = sc.parallelize([(1, 65), (2, 66), (3, 65), (4, 68), (5, 71)])

df = sqlc.createDataFrame(rdd, ["account_nbr", "date_time"])
df.registerTempTable("test1")

df2 = sqlc.sql("select a.*, nvl(lag(a.date_time) over (order by a.date_time),0) as lag_date_time from test1 a")

Also note the use of NVL to handle the nulls (if you want them replaced with 0's in this case)

Output:

+-----------+---------+-------------+
|account_nbr|date_time|lag_date_time|
+-----------+---------+-------------+
|          1|       65|            0|
|          3|       65|           65|
|          2|       66|           65|
|          4|       68|           66|
|          5|       71|           68|
+-----------+---------+-------------+
tbone
  • 15,107
  • 3
  • 33
  • 40