spark - extract elements from an RDD[Row] when reading Hive table in Spark

Question

I was going to read a Hive table in spark using scala, and extract some/all of fields from it and then save the data into HDFS.

My code is as follow:

val data = spark.sql("select * from table1 limit 1000")
val new_rdd = data.rdd.map(row => {
  var arr = new ArrayBuffer[String]
  val len = row.size
  for(i <- 0 to len-1) arr.+=(row.getAs[String](i))
  arr.toArray
})
new_rdd.take(10).foreach(println)
new_rdd.map(_.mkString("\t")).saveAsTextFile(dataOutputPath)

The above chunk is the one that finally worked.

I had written another version, where this line:

for(i <- 0 to len-1) arr.+=(row.getAs[String](i))

was replaced by this line:

for(i <- 0 to len-1) arr.+=(row.get(i).toString)

To me, both lines did exactly the same thing: for each row, I get the ith element as a string, and put it into the ArrayBuffer, which comes to an Array at the end.

However, the two methods have different results.

The first line works well. Data were able to be correctly saved on HDFS.

While the Error was thrown when I am going to save the data if using the second line:

ERROR ApplicationMaster: User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 56 in stage 3.0 failed 4 times, most recent failure: Lost task 56.3 in stage 3.0 (TID 98, ip-172-31-18-87.ec2.internal, executor 6): java.lang.NullPointerException

Therefore, I wonder if there is some intrinsic differences in between

getAs[String](i)

and

get(i).toString

?

Many thanks

score 0 · Accepted Answer · answered Nov 07 '18 at 13:05

0

getAs[String](i) is the same as

get(i).asInstanceOf[String]

therefore it is just a type casting. toString is not.

answered Nov 07 '18 at 13:05

10465355

4,481
2
20
44

spark - extract elements from an RDD[Row] when reading Hive table in Spark

1 Answers1