Since we upgraded to pyspark 3.3.0
for our job we have issues with cached ps.Dataframe
that are then concat using pyspark pandas : ps.concat([df1,df2])
This issue is that the concatenated data frame is not using the cached data but is re-reading the source data. Which in our case is causing an Authentication issue as source.
This was not the behavior we had with pyspark 3.2.3
.
This minimal code is able to show the issue.
import pyspark.pandas as ps
import pyspark
from pyspark.sql import SparkSession
import sys
import os
os.environ["PYSPARK_PYTHON"] = sys.executable
spark = SparkSession.builder.appName('bug-pyspark3.3').getOrCreate()
df1 = ps.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]}, columns=['col1', 'col2'])
df2 = ps.DataFrame(data={'col3': [5, 6]}, columns=['col3'])
cached_df1 = df1.spark.cache()
cached_df2 = df2.spark.cache()
cached_df1.count()
cached_df2.count()
merged_df = ps.concat([cached_df1,cached_df2], ignore_index=True)
merged_df.head()
merged_df.spark.explain()
Output of the explain()
on pyspark 3.2.3
:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [(cast(_we0#1300 as bigint) - 1) AS __index_level_0__#1298L, col1#1291L, col2#1292L, col3#1293L]
+- Window [row_number() windowspecdefinition(_w0#1299L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS _we0#1300], [_w0#1299L ASC NULLS FIRST]
+- Sort [_w0#1299L ASC NULLS FIRST], false, 0
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [plan_id=356]
+- Project [col1#1291L, col2#1292L, col3#1293L, monotonically_increasing_id() AS _w0#1299L]
+- Union
:- Project [col1#941L AS col1#1291L, col2#942L AS col2#1292L, null AS col3#1293L]
: +- InMemoryTableScan [col1#941L, col2#942L]
: +- InMemoryRelation [__index_level_0__#940L, col1#941L, col2#942L, __natural_order__#946L], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- *(1) Project [__index_level_0__#940L, col1#941L, col2#942L, monotonically_increasing_id() AS __natural_order__#946L]
: +- *(1) Scan ExistingRDD[__index_level_0__#940L,col1#941L,col2#942L]
+- Project [null AS col1#1403L, null AS col2#1404L, col3#952L]
+- InMemoryTableScan [col3#952L]
+- InMemoryRelation [__index_level_0__#951L, col3#952L, __natural_order__#955L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Project [__index_level_0__#951L, col3#952L, monotonically_increasing_id() AS __natural_order__#955L]
+- *(1) Scan ExistingRDD[__index_level_0__#951L,col3#952L]
We can see that the cache is used in the planned execution (InMemoryTableScan
).
Output of the explain()
on pyspark 3.3.0
:
== Physical Plan ==
AttachDistributedSequence[__index_level_0__#771L, col1#762L, col2#763L, col3#764L] Index: __index_level_0__#771L
+- Union
:- *(1) Project [col1#412L AS col1#762L, col2#413L AS col2#763L, null AS col3#764L]
: +- *(1) Scan ExistingRDD[__index_level_0__#411L,col1#412L,col2#413L]
+- *(2) Project [null AS col1#804L, null AS col2#805L, col3#423L]
+- *(2) Scan ExistingRDD[__index_level_0__#422L,col3#423L]
We can see on this version of pyspark that the Union
is performed by doing a Scan
of data instead of performing an InMemoryTableScan
Is this difference normal ? Is there any way to "force" the concat to use the cached dataframes ?