5

I have 2 questions regarding Spark serialization that I can simply find no answers to by googling.

  1. How can I print out the name of the currently used serializer; I want to know whether spark.serializer is Java or Kryo.
  2. I have the following code which is supposed to use Kryo serialization; the memory size used for the dataframe becomes 21meg which is a quarter of when I was just caching with no serialization; but when I remove the Kryo configuration, the size remains the same 21meg. Does this mean Kryo was never used in the first place? Could it be that because the records in the dataframe are simply rows, both Java and Kryo serialization are the same size?

    val conf = new SparkConf()    
    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")    
    conf.set("spark.kryo.registrationRequired", "false")    
    val spark = SparkSession.builder.master("local[*]").config(conf)
           .appName("KryoWithRegistrationNOTRequired").getOrCreate    
    val df = spark.read.csv("09-MajesticMillion.csv")    
    df.persist(StorageLevel.MEMORY_ONLY_SER)
    
user1888243
  • 2,591
  • 9
  • 32
  • 44

1 Answers1

4

Does this mean Kryo was never used in the first place?

It means exactly it. Spark SQL (Dataset) uses it's own columnar storage for caching. No Java or Kryo serialization is used therefore spark.serializer has no impact at all.

  • 6
    Thanks for your answer; can you refer me to a source or documentation that explains this? – user1888243 Dec 26 '17 at 19:43
  • @user9142754 than what is use of adding this configuration conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") ?? where is used ? – BdEngineer Jul 17 '20 at 08:19