4

The following Scala (Spark 1.6) code for reading a value from a Row fails with a NullPointerException when the value is null.

val test = row.getAs[Int]("ColumnName").toString

while this works fine

val test1 = row.getAs[Int]("ColumnName") // returns 0 for null
val test2 = test1.toString // converts to String fine

What is causing NullPointerException and what is the recommended way to handle such cases?

PS: getting row from DataFrame as follows:

val myRDD = myDF.repartition(partitions)
  .mapPartitions{ rows => 
    rows.flatMap{ row =>
      functionWithRows(row) //has above logic to read null column which fails
    }
  }

functionWithRows has then above mentioned NullPointerException.

MyDF schema:

root
 |-- LDID: string (nullable = true)
 |-- KTAG: string (nullable = true)
 |-- ColumnName: integer (nullable = true)
Shaido
  • 27,497
  • 23
  • 70
  • 73
Anurag Sharma
  • 2,409
  • 2
  • 16
  • 34
  • Can you edit your question and add the entire stacktrace? I can't seem to reproduce the issue with the recent version of Spark 2.3.0-SNAPSHOT. – Jacek Laskowski Dec 19 '17 at 11:36
  • @JacekLaskowski: I have abstracted out my production code. Using spark 1.6.1 it gives NullPointerException for the line: val test = row.getAs[Int]("ColumnName").toString – Anurag Sharma Dec 19 '17 at 11:53
  • 1
    @JacekLaskowski This does throw NPE `spark.sql(" select 1 as col union all select null as col").map(_.getAs[Int]("col").toString ).collect`. Removing `toString` works. – philantrovert Dec 19 '17 at 12:11

2 Answers2

2

getAs is defined as:

def getAs[T](i: Int): T = get(i).asInstanceOf[T]

and when we do toString we call Object.toString which doesn't depend on the type, so asInstanceOf[T] get dropped by the compiler, i.e.

row.getAs[Int](0).toString -> row.get(0).toString

we can confirm that by writing a simple scala code:

import org.apache.spark.sql._

object Test {
  val row = Row(null)
  row.getAs[Int](0).toString
}

and then compiling it:

$ scalac -classpath $SPARK_HOME/jars/'*' -print test.scala
[[syntax trees at end of                   cleanup]] // test.scala
package <empty> {
  object Test extends Object {
    private[this] val row: org.apache.spark.sql.Row = _;
    <stable> <accessor> def row(): org.apache.spark.sql.Row = Test.this.row;
    def <init>(): Test.type = {
      Test.super.<init>();
      Test.this.row = org.apache.spark.sql.Row.apply(scala.this.Predef.genericWrapArray(Array[Object]{null}));
      Test.this.row().getAs(0).toString();
      ()
    }
  }
}

So the proper way would be:

String.valueOf(row.getAs[Int](0))
Anurag Sharma
  • 2,409
  • 2
  • 16
  • 34
1

In order to avoid null values, it is a better practice is to use isNullAt before checking, as the documentation suggests:

getAs

<T> T getAs(int i)

Returns the value at position i. For primitive types if value is null it returns 'zero value' specific for primitive ie. 0 for Int - use isNullAt to ensure that value is not null

I agree the behaviour is confusing, though.

philantrovert
  • 9,904
  • 3
  • 37
  • 61
xmar
  • 1,729
  • 20
  • 48