Window function count() does not work properly when there is orderBy in the window definition

Question

In pyspark, when using count().over(window), if there is orderBy in the window definition, the results are not correct. Not sure if this is a bug, or there is a better way to do it.

Compare the same group with different window definition, one is with orderBy, another is not. They showed different results. The window definition without orderBy has expected results.

from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
cschema = StructType([StructField('customer',StringType()),StructField('sales', IntegerType())])
data = [
    ['Bob',20],
    ['Bob',30],
    ['Bob',22],
    ['John',33],
    ['John', 18],
    ['Bob', 30],
    ['John', 18]]
test_df = spark.createDataFrame(data, schema = cschema)
test_df.show()

+--------+-----+
|customer|sales|
+--------+-----+
|     Bob|   20|
|     Bob|   30|
|     Bob|   22|
|    John|   33|
|    John|   18|
|     Bob|   30|
|    John|   18|
+--------+-----+

win_ordered = Window.partitionBy('customer').orderBy(col('sales'))
win_non_ordered = Window.partitionBy('customer')
test_df.withColumn('cnt1', count(col('sales')).over(win_ordered)).withColumn('cnt2', count(col('sales')).over(win_non_ordered)).show()

+--------+-----+----+----+
|customer|sales|cnt1|cnt2|
+--------+-----+----+----+
|     Bob|   20|   1|   4|
|     Bob|   22|   2|   4|
|     Bob|   30|   4|   4|
|     Bob|   30|   4|   4|
|    John|   18|   2|   3|
|    John|   18|   2|   3|
|    John|   33|   3|   3|
+--------+-----+----+----+

I am expecting the 'cnt1' column has the same value across the group, just like 'cnt2' column.

it is not a bug ... it is how window functions work ... please see this post https://stackoverflow.com/questions/50945477/count-rows-in-partition-with-order-by — thePurplePython, Jul 18 '19 at 19:21
Possible duplicate of [Count rows in partition with Order By](https://stackoverflow.com/questions/50945477/count-rows-in-partition-with-order-by) — pault, Jul 18 '19 at 19:40

Window function count() does not work properly when there is orderBy in the window definition

0 Answers0