PySpark 3.3.0 is not using cached DataFrame when performing a concat with Pandas API

Question

Since we upgraded to pyspark 3.3.0 for our job we have issues with cached ps.Dataframe that are then concat using pyspark pandas : ps.concat([df1,df2])

This issue is that the concatenated data frame is not using the cached data but is re-reading the source data. Which in our case is causing an Authentication issue as source.

This was not the behavior we had with pyspark 3.2.3.

This minimal code is able to show the issue.

import pyspark.pandas as ps
import pyspark
from pyspark.sql import SparkSession

import sys
import os
os.environ["PYSPARK_PYTHON"] = sys.executable

spark = SparkSession.builder.appName('bug-pyspark3.3').getOrCreate()

df1 = ps.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}, columns=['col1', 'col2'])
df2 = ps.DataFrame(data={'col3': [5, 6]}, columns=['col3'])
cached_df1 = df1.spark.cache()
cached_df2 = df2.spark.cache()

cached_df1.count()
cached_df2.count()

merged_df = ps.concat([cached_df1,cached_df2], ignore_index=True)
merged_df.head()
merged_df.spark.explain()

Output of the explain() on pyspark 3.2.3 :

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [(cast(_we0#1300 as bigint) - 1) AS __index_level_0__#1298L, col1#1291L, col2#1292L, col3#1293L]
   +- Window [row_number() windowspecdefinition(_w0#1299L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _we0#1300], [_w0#1299L ASC NULLS FIRST]
      +- Sort [_w0#1299L ASC NULLS FIRST], false, 0
         +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=356]
            +- Project [col1#1291L, col2#1292L, col3#1293L, monotonically_increasing_id() AS _w0#1299L]
               +- Union
                  :- Project [col1#941L AS col1#1291L, col2#942L AS col2#1292L, null AS col3#1293L]
                  :  +- InMemoryTableScan [col1#941L, col2#942L]
                  :        +- InMemoryRelation [__index_level_0__#940L, col1#941L, col2#942L, __natural_order__#946L], StorageLevel(disk, memory, deserialized, 1 replicas)
                  :              +- *(1) Project [__index_level_0__#940L, col1#941L, col2#942L, monotonically_increasing_id() AS __natural_order__#946L]
                  :                 +- *(1) Scan ExistingRDD[__index_level_0__#940L,col1#941L,col2#942L]
                  +- Project [null AS col1#1403L, null AS col2#1404L, col3#952L]
                     +- InMemoryTableScan [col3#952L]
                           +- InMemoryRelation [__index_level_0__#951L, col3#952L, __natural_order__#955L], StorageLevel(disk, memory, deserialized, 1 replicas)
                                 +- *(1) Project [__index_level_0__#951L, col3#952L, monotonically_increasing_id() AS __natural_order__#955L]
                                    +- *(1) Scan ExistingRDD[__index_level_0__#951L,col3#952L]

We can see that the cache is used in the planned execution (InMemoryTableScan).

Output of the explain() on pyspark 3.3.0 :

== Physical Plan ==
AttachDistributedSequence[__index_level_0__#771L, col1#762L, col2#763L, col3#764L] Index: __index_level_0__#771L
+- Union
   :- *(1) Project [col1#412L AS col1#762L, col2#413L AS col2#763L, null AS col3#764L]
   :  +- *(1) Scan ExistingRDD[__index_level_0__#411L,col1#412L,col2#413L]
   +- *(2) Project [null AS col1#804L, null AS col2#805L, col3#423L]
      +- *(2) Scan ExistingRDD[__index_level_0__#422L,col3#423L]

We can see on this version of pyspark that the Union is performed by doing a Scan of data instead of performing an InMemoryTableScan

Is this difference normal ? Is there any way to "force" the concat to use the cached dataframes ?

Derek O · Answer 1 · 2023-02-12T03:05:03.627

I cannot explain the difference in the planned execution output between pyspark 3.2.3 and 3.3.0, but I believe that despite this difference the cache is being used. I ran some benchmarks with and without caching using an example very similar to yours, and the average time for a merge operation to be performed is shorter when we cache the DataFrames.

def test_merge_without_cache(n=5, size=10**5):
    np.random.seed(44)
    total_run_times = []
    
    for i in range(n):
        data = np.random.rand(size,2)
        data2 = np.random.rand(size,2)

        df1 = ps.DataFrame(data, columns=['col1','col2'])
        df2 = ps.DataFrame(data2, columns=['col3','col4'])
        
        start_time = time.time()
        merged_df = ps.concat([df1,df2], ignore_index=True)
        run_time = time.time() - start_time
        total_run_times.append(run_time)
        spark.catalog.clearCache()
        
    return total_run_times

def test_merge_with_cache(n=5, size=10**5):
    np.random.seed(44)
    total_run_times = []
    
    for i in range(n):
        data = np.random.rand(size,2)
        data2 = np.random.rand(size,2)

        df1 = ps.DataFrame(data, columns=['col1','col2'])
        df2 = ps.DataFrame(data2, columns=['col3','col4'])
        
        cached_df1 = df1.spark.cache()
        cached_df2 = df2.spark.cache()

        start_time = time.time()
        merged_df = ps.concat([cached_df1,cached_df2], ignore_index=True)
        run_time = time.time() - start_time
        total_run_times.append(run_time)
        spark.catalog.clearCache()
        
    return total_run_times

Here are the printouts from when I ran these two test functions:

total_run_times_without_cache = test_merge_without_cache(n=50, size=10**6)
np.mean(total_run_times_without_cache)
0.12456250190734863

total_run_times_with_cache = test_merge_with_cache(n=50, size=10**6)
np.mean(total_run_times_with_cache)
0.07876112937927246

This isn't the largest difference in speed so it's possible this is just noise and the cache is, in fact, not being used (but I did run this benchmark several times and the merge operation with cache was consistently faster). Someone with a better understanding of pyspark might be able to better explain what you're observing, but hopefully this answer helps a bit.

Here is a plot of the execution time between merge with and without cache:

import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(go.Scatter(y=total_run_times_without_cache, name='without cache'))
fig.add_trace(go.Scatter(y=total_run_times_with_cache, name='with cache'))

Interesting what version did you used to run these benchmarks ? However, I think this is just noise as the original issue was with Dataframe read from parquet file with auth tokens that expired after the parquet read. When the concat was performed and the resulting data frame read, we had an authentication issue as pyspark was re-reading the parquet. — frco9, Feb 13 '23 at 07:16
I used `pyspark 3.3.0` for these benchmarks. Perhaps I would be able to better reproduce your behavior if I read DataFrames in from an external parquet file — Derek O, Feb 13 '23 at 15:14

PySpark 3.3.0 is not using cached DataFrame when performing a concat with Pandas API

1 Answers1