7

I'm running a PySpark job, and I'm getting the following message:

WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

What does the message indicate, and how do I define a partition for a Window operation?

EDIT:

I'm trying to rank on an entire column.

My data is organized as:

A
B
A
C
D

And I want:

A,1
B,3
A,1
C,4
D,5

I don't think there should by a .partitionBy() for this, only .orderBy(). The trouble is, this appears to cause performance degradation. Is there another way to achieve this without a Window function?

If I partition by the first column, the result would be:

A,1
B,1
A,1
C,1
D,1

Which I do not want.

eliasah
  • 39,588
  • 11
  • 124
  • 154
cshin9
  • 1,440
  • 5
  • 20
  • 33
  • if the one of the answers provided solves your problems, please accept it so we can close this question ! – eliasah Apr 22 '16 at 11:36
  • Sorry, none of the answers have provided a solution yet. – cshin9 Apr 22 '16 at 13:13
  • @cshin9 Well, actually existing answer is exactly addressing your question. There is no special magic which can make window function without partitioning efficient. – zero323 Apr 24 '16 at 08:31

1 Answers1

7

Given the information given to the question, at best I can provide a skeleton on how partitions should be defined on Window functions :

from pyspark.sql.window import Window

windowSpec = \
     Window \
     .partitionBy(...) \ # Here is where you define partitioning
     .orderBy(…)

This is equivalent to the following SQL :

OVER (PARTITION BY ... ORDER BY …)

So concerning partitioning specification :

It controls which rows will be in the same partition with the given row. You might want to make sure all rows having the same value for the partition column are collected to the same machine before ordering and calculating the frame.

If you don't give any partitioning specification, then all data must be collected to a single machine, thus the following error message :

WARN org.apache.spark.sql.execution.Window: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
eliasah
  • 39,588
  • 11
  • 124
  • 154
  • 2
    What if I want to order by the entire table and not use .partitionBy()? Is there a more efficient way to do it? (i.e. RANK() OVER(ORDER BY ...) – cshin9 Apr 07 '16 at 14:07
  • 1
    the only efficient way is to partitionBy ! – eliasah Apr 07 '16 at 14:24
  • 2
    What should I partition by if I'm ranking on the whole table. Partition implies that I want ranking for each partition separately. – cshin9 Apr 07 '16 at 15:18
  • I can't answer this question without a context. You'll have to update your question with at least the DataFrame schema that you are trying to perform the Window function on. – eliasah Apr 07 '16 at 15:20
  • I've updated with an example of what I'm trying to do. – cshin9 Apr 07 '16 at 15:22
  • What you are doing looks like basic order by, I still don't understand the usage of window function for that purpose. – eliasah Apr 07 '16 at 16:32
  • I need to make a second column assigning the rank to each element in the first column. – cshin9 Apr 07 '16 at 17:37
  • You can still zipWithIndex after ordering. – eliasah Apr 07 '16 at 17:38
  • Can you please post the solution you followed to solve this issue? @eliasah I can't see to see that discussion that happened or a solution provided here. – CodeReaper Mar 20 '18 at 13:30
  • I have provided the solution in my answer @CodeReaper – eliasah Mar 20 '18 at 14:19
  • 1
    I don't see any comment related with zipWithIndex in your answer @eliasah. I didn't get your idea of zipWithIndex after ordering – Galuoises Jul 27 '20 at 09:01