I am working on a sequence of transformations in PySpark (version 3.3.1).
At certain point I have a dropDuplicates(subset=[X])
followed by a exceptAll
, and I get an error. Here is a reproducible pipeline:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df1 = spark.createDataFrame(
[(1, 'a', 'AAA'), (1, 'a', 'XXX'), (2, 'b', 'BBB')],
['n', 'm', 'raw']
)
df2 = spark.createDataFrame(
[(1, 'a', 'AAA'), (2, 'b', 'BBB')],
['n', 'm', 'raw']
)
df1.show()
df2.show()
+---+---+---+
| n| m|raw|
+---+---+---+
| 1| a|AAA|
| 1| a|XXX|
| 2| b|BBB|
+---+---+---+
+---+---+---+
| n| m|raw|
+---+---+---+
| 1| a|AAA|
| 2| b|BBB|
+---+---+---+
We dropDuplicates
from df1
using subset
:
df1 = df1.dropDuplicates(subset=['n', 'm'])
df1.show()
And we get this, as expected:
+---+---+---+
| n| m|raw|
+---+---+---+
| 1| a|AAA|
| 2| b|BBB|
+---+---+---+
Then we attempt a exceptAll
between DataFrames:
df1.exceptAll(df2).show()
And the error arises:
23/07/27 17:19:33 ERROR Executor: Exception in task 0.0 in stage 128.0 (TID 230)
java.lang.IllegalStateException: Couldn't find raw#931 in [n#929L,m#930,raw#1007,sum#1005L]
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
at scala.collection.immutable.List.map(List.scala:297)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:528)
at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:73)
at org.apache.spark.sql.execution.GenerateExec.boundGenerator$lzycompute(GenerateExec.scala:75)
at org.apache.spark.sql.execution.GenerateExec.boundGenerator(GenerateExec.scala:75)
at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$6(GenerateExec.scala:101)
at org.apache.spark.sql.execution.LazyIterator.results$lzycompute(GenerateExec.scala:36)
at org.apache.spark.sql.execution.LazyIterator.results(GenerateExec.scala:36)
at org.apache.spark.sql.execution.LazyIterator.hasNext(GenerateExec.scala:37)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$ConcatIterator.advance(Iterator.scala:199)
at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:227)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:136)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
I expected the exceptAll
to run correctly, in this case returning an empty DataFrame, since df1
(after the dropDuplicates
) is identical to df2
.
I have noticed that the error mentions that the raw
column could not be found (Couldn't find raw#931 in [n#929L,m#930,raw#1007,sum#1005L]
), i.e., it looked for raw#931
but only raw#1007
is available (I believe the numbers are the references Spark uses to identify the columns, right?)
Furthermore, if we remove the parameter subset
from dropDuplicates
, the error is not raised.
Am I doing something wrong?
Is there a way to remove duplicates (only using a subset of columns), and use the resulting DataFrame into a exceptAll
comparison?