How can I cache DataFrame with Kryo Serializer in Spark?

Question

I am trying to use Spark with Kryo Serializer to store some data with less memory cost. And now I come across a trouble, I cannot save a DataFram e(whose type is Dataset[Row]) in memory with Kryo serializer. I thought all I need to do is to add org.apache.spark.sql.Row to classesToRegister, but error still occurs:

spark-shell --conf spark.kryo.classesToRegister=org.apache.spark.sql.Row --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrationRequired=true

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.storage.StorageLevel

val schema = StructType(StructField("name", StringType, true) :: StructField("id", IntegerType, false) :: Nil)
val seq = Seq(("hello", 1), ("world", 2))
val df = spark.createDataFrame(sc.emptyRDD[Row], schema).persist(StorageLevel.MEMORY_ONLY_SER)
df.count()

Error occurs like this: enter image description here

I don't think adding byte[][] to classesToRegister is a good idea. So what should I do to store a dataframe in memory with Kryo?

score 2 · Answer 1 · answered Jun 11 '17 at 15:23

2

Datasets don't use standard serialization methods. They use specialized columnar storage with its own compression methods so you don't need to store your Dataset with the Kryo Serializer.

answered Jun 11 '17 at 15:23

eliasah

39,588
11
124
154

How can I cache DataFrame with Kryo Serializer in Spark?

1 Answers1

Linked