0

I set this setting : --conf spark.sql.autoBroadcastJoinThreshold=209715200 //200mb

And i want to decrease this amount to be just a little higher than a specific dataFrame (Let s call it bdrDf)

I tried to esmimate the bdrDf :

import org.apache.commons.io.FileUtils

val bytes = sparkSession.sessionState.executePlan(bdrDf.queryExecution.logical)
.optimizedPlan.stats(sparkSession.sessionState.conf).sizeInBytes

println("bdrDfsize mb:" + FileUtils.byteCountToDisplaySize(bytes.toLong))

i got : 58 MB

Is this the size that Spark will get when it will check if the dataframe is below or not of spark.sql.autoBroadcastJoinThreshold ?

I also saw this metric of the sparkUI :

enter image description here

It corresponds to 492 MB

Is one of my values correct ? if no, how to estimate the size of my dataframe ?

code:

val Df= readFromHive()      
import org.apache.commons.io.FileUtils     

def checkSize(df: DataFrame)(implicit spark: SparkSession) = {       
  df.cache.foreach(el => el)       
  val catalyst_plan = df.queryExecution.logical

 val df_size_in_bytes = spark.sessionState.executePlan(catalyst_plan).optimizedPlan.stats(sparkSession.sessionState.conf).sizeInBytes

 logger.info("size in mO:" + 
   FileUtils.byteCountToDisplaySize(df_size_in_bytes.toLong))       
 logger.info("size bytes:" + df_size_in_bytes)     

}     

checkSize(Df)
Emiliano Martinez
  • 4,073
  • 2
  • 9
  • 19
Marwan02
  • 45
  • 6

1 Answers1

0

I used this function:

  def checkSize(df: DataFrame)(implicit spark: SparkSession) = {
    df.cache.foreach(el => el)
    val catalyst_plan = df.queryExecution.logical
    val df_size_in_bytes = spark.sessionState.executePlan(
      catalyst_plan).optimizedPlan.statistics.sizeInBytes
    df_size_in_bytes
  }

With this method is mandatory to cache the df, and because it is a lazy operation you need to perform the foreach action, a little weird ..., check if that works for you

Emiliano Martinez
  • 4,073
  • 2
  • 9
  • 19
  • I used `.stats(sparkSession.sessionState.conf)` instead of `statistics`, maybe its because your code is for version higher of 2.2 So this code will give me the amount that spark will compare with `spark.sql.autoBroadcastJoinThreshold` ? – Marwan02 Nov 16 '21 at 13:02
  • maybe, it is a little bit old. I had this function in a project running on a 2.2. This code gives the amount of bytes of the dataframe, so you could see of it exceeds the threshold for the broadcast process in join operations. – Emiliano Martinez Nov 16 '21 at 13:17
  • i get size in m:303 MB size bytes:317808272 Its higher than my spark.sql.autoBroadcastJoinThreshold (200mb) and my df is broadcasted ... – Marwan02 Nov 16 '21 at 13:27
  • can you add the code and an image of the driver console with the stages? – Emiliano Martinez Nov 16 '21 at 13:29
  • ```val Df= readFromHive() import org.apache.commons.io.FileUtils def checkSize(df: DataFrame)(implicit spark: SparkSession) = { df.cache.foreach(el => el) val catalyst_plan = df.queryExecution.logical val df_size_in_bytes = spark.sessionState.executePlan( catalyst_plan).optimizedPlan.stats(sparkSession.sessionState.conf).sizeInBytes logger.info("size in mO:" + FileUtils.byteCountToDisplaySize(df_size_in_bytes.toLong)) logger.info("size bytes:" + df_size_in_bytes) } checkSize(Df) ``` I dont know what do you want in the img? – Marwan02 Nov 16 '21 at 13:35
  • Besides the call, what spark operation is executing the broadcast.. a join? – Emiliano Martinez Nov 16 '21 at 14:52
  • Yes spark is doing a join – Marwan02 Nov 21 '21 at 23:00