0

I am trying to use Spark with Kryo Serializer to store some data with less memory cost. And now I come across a trouble, I cannot save a DataFram e(whose type is Dataset[Row]) in memory with Kryo serializer. I thought all I need to do is to add org.apache.spark.sql.Row to classesToRegister, but error still occurs:

spark-shell --conf spark.kryo.classesToRegister=org.apache.spark.sql.Row --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.kryo.registrationRequired=true
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.storage.StorageLevel

val schema = StructType(StructField("name", StringType, true) :: StructField("id", IntegerType, false) :: Nil)
val seq = Seq(("hello", 1), ("world", 2))
val df = spark.createDataFrame(sc.emptyRDD[Row], schema).persist(StorageLevel.MEMORY_ONLY_SER)
df.count()

Error occurs like this: enter image description here

I don't think adding byte[][] to classesToRegister is a good idea. So what should I do to store a dataframe in memory with Kryo?

zero323
  • 322,348
  • 103
  • 959
  • 935
Raul
  • 866
  • 7
  • 4

1 Answers1

2

Datasets don't use standard serialization methods. They use specialized columnar storage with its own compression methods so you don't need to store your Dataset with the Kryo Serializer.

eliasah
  • 39,588
  • 11
  • 124
  • 154