0

I would like to calculate the difference between two values from within the same column. Right now I just want the difference between the last value and the first value, however using last(column) returns a null result. Is there a reason last() would not be returning a value? Is there a way to pass the position of the values I want as variables; ex: the 10th and the 1st, or the 7th and the 6th?

Current code Using Spark 1.4.0 and Scala 2.11.6

myDF = some dataframe with n rows by m columns

def difference(col: Column): Column = { last(col)-first(col) }

def diffCalcs(dataFrame: DataFrame): DataFrame = { import hiveContext.implicits._ dataFrame.agg( difference($"Column1"), difference($"Column2"), difference($"Column3"), difference($"Column4") ) }

When I run diffCalcs(myDF) it returns a null result. If I modify difference to only have first(col), it does return the first value for the four columns. However, if I change it to last(col), it returns null. If I call myDF.show(), I can see that all of columns have Double values on every row, there are no null values in any of the columns.

the3rdNotch
  • 637
  • 2
  • 8
  • 18

1 Answers1

0

After updating to Spark 1.5.0, I was able to use the code snippet provided in the question and it worked. That was what ultimately fixed it. Just for completeness, I have included the code that I used after updating the Spark version.

def difference(col:Column): Column = {
  last(col)-first(col)
}

def diffCalcs(dataFrame: DataFrame): DataFrame = {
  import hiveContext.implicits._
  dataFrame.agg(
    difference($"Column1").alias("newColumn1"),
    difference($"Column2").alias("newColumn2"),
    difference($"Column3").alias("newColumn3"),
    difference($"Column4").alias("newColumn4")
  )
}
the3rdNotch
  • 637
  • 2
  • 8
  • 18