I was going to read a Hive table in spark using scala, and extract some/all of fields from it and then save the data into HDFS.
My code is as follow:
val data = spark.sql("select * from table1 limit 1000")
val new_rdd = data.rdd.map(row => {
var arr = new ArrayBuffer[String]
val len = row.size
for(i <- 0 to len-1) arr.+=(row.getAs[String](i))
arr.toArray
})
new_rdd.take(10).foreach(println)
new_rdd.map(_.mkString("\t")).saveAsTextFile(dataOutputPath)
The above chunk is the one that finally worked.
I had written another version, where this line:
for(i <- 0 to len-1) arr.+=(row.getAs[String](i))
was replaced by this line:
for(i <- 0 to len-1) arr.+=(row.get(i).toString)
To me, both lines did exactly the same thing: for each row, I get the ith element as a string, and put it into the ArrayBuffer, which comes to an Array at the end.
However, the two methods have different results.
The first line works well. Data were able to be correctly saved on HDFS.
While the Error was thrown when I am going to save the data if using the second line:
ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 56 in stage 3.0 failed 4 times, most recent failure: Lost task 56.3 in stage 3.0 (TID 98, ip-172-31-18-87.ec2.internal, executor 6): java.lang.NullPointerException
Therefore, I wonder if there is some intrinsic differences in between
getAs[String](i)
and
get(i).toString
?
Many thanks