1

I am trying to understand how a frame is constructed when using a window function and seeing some unexpected behavior. This is the code (can be run as is):

from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, FloatType, StringType, IntegerType

spark = SparkSession.builder.getOrCreate()


schema = StructType([StructField('product', StringType(), True),
                     StructField('category', StringType(), True),
                     StructField('revenue', IntegerType(), True)])

spark.createDataFrame(

    [
        ("Thin", "Cell Phone", 6000),
        ("Normal", "Tablet", 1500),
        ("Mini", "Tablet", 5500),
        ("Ultra thin", "Cell Phone", 5000),
        ("Very thin", "Cell Phone", 6000),
        ("Big", "Tablet", 2500),
        ("Bendable", "Cell Phone", 3000),
        ("Foldable", "Cell Phone", 3000),
        ("Pro", "Tablet", 4500),
        ("Pro2", "Tablet", 6500),
     ] ,schema=schema ).createOrReplaceTempView("revenues")

# using collect list to print the contents of the current frame
# using min to show the unexpected behavior
spark.sql("""
    select product,
           category,
           revenue,
           collect_list((product, revenue)) over (partition by category order by revenue desc) frame,
           min(revenue) over (partition by category order by revenue desc) as min
    from revenues
""").show(truncate=False)

And this is the output:

+----------+----------+-------+-----------------------------------------------------------------------------------------+----+
|product   |category  |revenue|frame                                                                                    |min |
+----------+----------+-------+-----------------------------------------------------------------------------------------+----+
|Pro2      |Tablet    |6500   |[[Pro2, 6500]]                                                                           |6500|
|Mini      |Tablet    |5500   |[[Pro2, 6500], [Mini, 5500]]                                                             |5500|
|Pro       |Tablet    |4500   |[[Pro2, 6500], [Mini, 5500], [Pro, 4500]]                                                |4500|
|Big       |Tablet    |2500   |[[Pro2, 6500], [Mini, 5500], [Pro, 4500], [Big, 2500]]                                   |2500|
|Normal    |Tablet    |1500   |[[Pro2, 6500], [Mini, 5500], [Pro, 4500], [Big, 2500], [Normal, 1500]]                   |1500|
|Thin      |Cell Phone|6000   |[[Thin, 6000], [Very thin, 6000]]                                                        |6000|
|Very thin |Cell Phone|6000   |[[Thin, 6000], [Very thin, 6000]]                                                        |6000|
|Ultra thin|Cell Phone|5000   |[[Thin, 6000], [Very thin, 6000], [Ultra thin, 5000]]                                    |5000|
|Bendable  |Cell Phone|3000   |[[Thin, 6000], [Very thin, 6000], [Ultra thin, 5000], [Bendable, 3000], [Foldable, 3000]]|3000|
|Foldable  |Cell Phone|3000   |[[Thin, 6000], [Very thin, 6000], [Ultra thin, 5000], [Bendable, 3000], [Foldable, 3000]]|3000|
+----------+----------+-------+-----------------------------------------------------------------------------------------+----+

Now the question:

In this case, the window function does not specify any restrictions on the frame. I would expect the frame to be similar in all rows (and correspond to the full group of values). In practice, I see that the frame expands (not sure according to what criteria). This causes an incorrect value to be returned by the 'min' functions- 6500 for 'Tablet' category instead of 1500.

If I remove the order by clause in the window spec, I see the expected behavior:

collect_list((product, revenue)) over (partition by category) frame, 
min(revenue) over (partition by category) as min

Yields:

+----------+----------+-------+-----------------------------------------------------------------------------------------+----+
|product   |category  |revenue|frame                                                                                    |min |
+----------+----------+-------+-----------------------------------------------------------------------------------------+----+
|Normal    |Tablet    |1500   |[[Normal, 1500], [Mini, 5500], [Big, 2500], [Pro, 4500], [Pro2, 6500]]                   |1500|
|Mini      |Tablet    |5500   |[[Normal, 1500], [Mini, 5500], [Big, 2500], [Pro, 4500], [Pro2, 6500]]                   |1500|
|Big       |Tablet    |2500   |[[Normal, 1500], [Mini, 5500], [Big, 2500], [Pro, 4500], [Pro2, 6500]]                   |1500|
|Pro       |Tablet    |4500   |[[Normal, 1500], [Mini, 5500], [Big, 2500], [Pro, 4500], [Pro2, 6500]]                   |1500|
|Pro2      |Tablet    |6500   |[[Normal, 1500], [Mini, 5500], [Big, 2500], [Pro, 4500], [Pro2, 6500]]                   |1500|
|Thin      |Cell Phone|6000   |[[Thin, 6000], [Ultra thin, 5000], [Very thin, 6000], [Bendable, 3000], [Foldable, 3000]]|3000|
|Ultra thin|Cell Phone|5000   |[[Thin, 6000], [Ultra thin, 5000], [Very thin, 6000], [Bendable, 3000], [Foldable, 3000]]|3000|
|Very thin |Cell Phone|6000   |[[Thin, 6000], [Ultra thin, 5000], [Very thin, 6000], [Bendable, 3000], [Foldable, 3000]]|3000|
|Bendable  |Cell Phone|3000   |[[Thin, 6000], [Ultra thin, 5000], [Very thin, 6000], [Bendable, 3000], [Foldable, 3000]]|3000|
|Foldable  |Cell Phone|3000   |[[Thin, 6000], [Ultra thin, 5000], [Very thin, 6000], [Bendable, 3000], [Foldable, 3000]]|3000|
+----------+----------+-------+-----------------------------------------------------------------------------------------+----+

Are my assumptions incorrect? If so - What are the mechanics of frame construction in both cases?

Tested on Spark 3.0.1

Vitaliy
  • 8,044
  • 7
  • 38
  • 66

1 Answers1

1

This is the expected behaviour: the documentation states:

When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, unboundedFollowing) is used by default. When ordering is defined, a growing window frame (rangeFrame, unboundedPreceding, currentRow) is used by default.

werner
  • 13,518
  • 6
  • 30
  • 45