I am currently working on migrating a hive table. The problem I am facing is, when I try to load it into a dataframe in spark and write back to the Hive, it gives me a serialization error like in the screen shot attached. However, when I try to perform a Dataframe transformation with ~60 columns, it works fine. We have approximately 200 columns that we need out of the 1800 columns in the table. I think it’s because of size of the table that’s causing that error.
I think the serialization error is caused by “Spark version 2.2.0” (being used in our dev clusters). There is a bug related to “org.apache.spark.unsafe.types.UTF8String$IntWrapper” that causes this error in this version. In spark versions 2.2.1 and higher this bug was resolved as per the issue report. As a turn around to this I tried to wrap the jar that I deployed, with the updated dependencies. I still face the same issue because I think it is still using the default jars from the dev environment. I would be very thankful if there is any alternative to run it.
Link to the apache bug= https://issues.apache.org/jira/browse/SPARK-21445
Link to the solution = https://github.com/apache/spark/pull/18660/commits/d2202903518b3dfa0f4a719a0b9cb5431088ed66
In the solution to the problem they wrote the code in java, I am writing mine in scala. I want to know how to import "IntWrapper" and "LongWrapper" classes from UTF8String library and let the program use my declared variables. For this I have written classes taking the class in the link as reference and using KryoRegistrator in spark to call those classes to my "spark session". Is this correct? bsTst3.scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.sources.IsNotNull
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.SerializableWritable
import java.io._
@SerialVersionUID(0L)
class IntWrapper extends Serializable{
@transient var value:Int=0
}
@SerialVersionUID(1L)
class LongWrapper extends Serializable{
@transient var value:Long=0
}
object bsTst3 {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("DrfApp")
.config("spark.kryo.registrator", "bsKryoRegistrator")
.config("spark.kryoserializer.buffer", "1024k")
.config("spark.kryoserializer.buffer.max", "1024m")
.enableHiveSupport().getOrCreate()
val bDRF = spark.sql("select * from global.table_partition1 limit 10")
import spark.implicits._
bDRF.write.saveAsTable("katukuri.sample3")
}
}
and this is scala class is what i'm calling from the main object bsKryoRegistrator.scala
import org.apache.spark.serializer.KryoRegistrator
import com.esotericsoftware.kryo.Kryo;
import org.apache.hadoop.io.NullWritable
class bsKryoRegistrator extends KryoRegistrator {
override def registerClasses(kryo: Kryo) {
kryo.register(classOf[Byte])
kryo.register(classOf[Short])
//kryo.register(classOf[Int])
//kryo.register(classOf[Long])
kryo.register(classOf[IntWrapper])
kryo.register(classOf[LongWrapper])
kryo.register(classOf[Float])
kryo.register(classOf[Double])
kryo.register(classOf[String])
kryo.register(classOf[Boolean])
kryo.register(classOf[Char])
kryo.register(classOf[Null])
kryo.register(classOf[Nothing])
//kryo.register(classOf[None])
kryo.register(classOf[NullWritable])
}
}
command that i use to run it on my dev cluster
spark2-submit --class bsTst3 --master yarn --queue root.default --deploy-mode client BSDrdTest3-0.0.1-SNAPSHOT.jar
error i get, which is similar to the one in the link to the bug
19/02/27 18:06:14 INFO spark.SparkContext: Starting job: saveAsTable at bsTst3.scala:37
19/02/27 18:06:14 INFO scheduler.DAGScheduler: Registering RDD 6 (saveAsTable at bsTst3.scala:37)
19/02/27 18:06:14 INFO scheduler.DAGScheduler: Got job 0 (saveAsTable at bsTst3.scala:37) with 1 output partitions
19/02/27 18:06:14 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (saveAsTable at bsTst3.scala:37)
19/02/27 18:06:14 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
19/02/27 18:06:14 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
19/02/27 18:06:14 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[6] at saveAsTable at bsTst3.scala:37), which has no missing parents
19/02/27 18:06:14 INFO cluster.YarnScheduler: Cancelling stage 0
19/02/27 18:06:14 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (saveAsTable at bsTst3.scala:37) failed in Unknown s due to Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.unsafe.types.UTF8String$IntWrapper
Serialization stack:
- object not serializable (class: org.apache.spark.unsafe.types.UTF8String$IntWrapper, value: org.apache.spark.unsafe.types.UTF8String$IntWrapper@63ec9c06)
- field (class: org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, name: result$2, type: class org.apache.spark.unsafe.types.UTF8String$IntWrapper)