In PySpark, how do I avoid an error when using exceptAll after a dropDuplicates (with subset)?

Question

I am working on a sequence of transformations in PySpark (version 3.3.1).

At certain point I have a dropDuplicates(subset=[X]) followed by a exceptAll, and I get an error. Here is a reproducible pipeline:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df1 = spark.createDataFrame(
    [(1, 'a', 'AAA'), (1, 'a', 'XXX'), (2, 'b', 'BBB')],
    ['n', 'm', 'raw']
)

df2 = spark.createDataFrame(
    [(1, 'a', 'AAA'), (2, 'b', 'BBB')],
    ['n', 'm', 'raw']
)

df1.show()
df2.show()

+---+---+---+
|  n|  m|raw|
+---+---+---+
|  1|  a|AAA|
|  1|  a|XXX|
|  2|  b|BBB|
+---+---+---+

+---+---+---+
|  n|  m|raw|
+---+---+---+
|  1|  a|AAA|
|  2|  b|BBB|
+---+---+---+

We dropDuplicates from df1 using subset:

df1 = df1.dropDuplicates(subset=['n', 'm'])
df1.show()

And we get this, as expected:

+---+---+---+
|  n|  m|raw|
+---+---+---+
|  1|  a|AAA|
|  2|  b|BBB|
+---+---+---+

Then we attempt a exceptAll between DataFrames:

df1.exceptAll(df2).show()

And the error arises:

23/07/27 17:19:33 ERROR Executor: Exception in task 0.0 in stage 128.0 (TID 230)
java.lang.IllegalStateException: Couldn't find raw#931 in [n#929L,m#930,raw#1007,sum#1005L]
    at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
    at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
    at scala.collection.immutable.List.map(List.scala:297)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
    at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
    at org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75)
    at org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75)
    at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$6(GenerateExec.scala:101)
    at org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36)
    at org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36)
    at org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199)
    at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

I expected the exceptAll to run correctly, in this case returning an empty DataFrame, since df1 (after the dropDuplicates) is identical to df2.

I have noticed that the error mentions that the raw column could not be found (Couldn't find raw#931 in [n#929L,m#930,raw#1007,sum#1005L]), i.e., it looked for raw#931 but only raw#1007 is available (I believe the numbers are the references Spark uses to identify the columns, right?)

Furthermore, if we remove the parameter subset from dropDuplicates, the error is not raised.

Am I doing something wrong? Is there a way to remove duplicates (only using a subset of columns), and use the resulting DataFrame into a exceptAll comparison?

I just tried adding ```df1.cache()``` just before the ```df1.exceptAll(df2).show()``` and it worked but don't know why. it would be great if someone explains it. — Niveditha S, Jul 28 '23 at 22:51
Hi @NivedithaS! Thanks for the comment... I have tried adding `df1.cache()` and it indeed worked! So, it seems like a viable workaround. In my real-world scenario I am not sure if caching the df in that spot of the pipeline will be possible, but it is indeed valuable information! I wonder if this is something that should be informed to the team at Spark, or if it is something about the usage that is wrong... — PMHM, Jul 31 '23 at 23:46

In PySpark, how do I avoid an error when using exceptAll after a dropDuplicates (with subset)?

0 Answers0