I'm running a PySpark job, and I'm getting the following message:
WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
What does the message indicate, and how do I define a partition for a Window operation?
EDIT:
I'm trying to rank on an entire column.
My data is organized as:
A
B
A
C
D
And I want:
A,1
B,3
A,1
C,4
D,5
I don't think there should by a .partitionBy() for this, only .orderBy(). The trouble is, this appears to cause performance degradation. Is there another way to achieve this without a Window function?
If I partition by the first column, the result would be:
A,1
B,1
A,1
C,1
D,1
Which I do not want.