0

I'm trying to understand the spark cache manager behavior as I deployed my test code to spark job server to have long running context and want to test the behavior by executing the same job multiple time after each other to see how caching is.

val manager = spark.sharedState.cacheManager
val DF = collectData.retrieveDataFromCass(spark) // loaded from cassandra sucessfully with 2k rows
 val testCachedData = if (manager.lookupCachedData(DF.queryExecution.logical).isEmpty) 0 else 1
 DF.createOrReplaceTempView(tempName1)
 spark.sqlContext.cacheTable(tempName1)
  DF.count() // action 
testCachedData

Then I'm returning testCachedData.

I've expected to see testCachedData in the first job execution to see it 0 then in the next tries to be returning 1 But I've got all job returning it as 0 as it seems empty each time, But when I checked it from the spark UI STORAGE I could see there's a cache data there.

Why cache manager can't see my cache data in the same spark application ?

THIS SPARK TEST IS USING : SPARK 3.2 spark-cassandra-connector 3.0.1

  • 1
    Isn't the cache per job? I don't think Spark maintain a cache that can be shared across jobs, does it? – Gaël J Jun 13 '22 at 18:39
  • @GaëlJ, It seems each job execution is preparing a new df logical plan with different id's assigned to each columns, so next job can't see that it's exists and store it again. what do you think about this, is it related ? and yes it should be shared because you can notice the time difference between job executions after each other on spark server, it becomes faster so it should there a shared cache. – Rand Abu Salim Jun 14 '22 at 11:56
  • Really not sure but cache might exist at a lower level than Spark. HDFS or OS level maybe. – Gaël J Jun 14 '22 at 18:05

0 Answers0