3

In pyspark, when using count().over(window), if there is orderBy in the window definition, the results are not correct. Not sure if this is a bug, or there is a better way to do it.

Compare the same group with different window definition, one is with orderBy, another is not. They showed different results. The window definition without orderBy has expected results.

from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
cschema = StructType([StructField('customer',StringType()),StructField('sales', IntegerType())])
data = [
    ['Bob',20],
    ['Bob',30],
    ['Bob',22],
    ['John',33],
    ['John', 18],
    ['Bob', 30],
    ['John', 18]]
test_df = spark.createDataFrame(data, schema = cschema)
test_df.show()

+--------+-----+
|customer|sales|
+--------+-----+
|     Bob|   20|
|     Bob|   30|
|     Bob|   22|
|    John|   33|
|    John|   18|
|     Bob|   30|
|    John|   18|
+--------+-----+

win_ordered = Window.partitionBy('customer').orderBy(col('sales'))
win_non_ordered = Window.partitionBy('customer')
test_df.withColumn('cnt1', count(col('sales')).over(win_ordered)).withColumn('cnt2', count(col('sales')).over(win_non_ordered)).show()

+--------+-----+----+----+
|customer|sales|cnt1|cnt2|
+--------+-----+----+----+
|     Bob|   20|   1|   4|
|     Bob|   22|   2|   4|
|     Bob|   30|   4|   4|
|     Bob|   30|   4|   4|
|    John|   18|   2|   3|
|    John|   18|   2|   3|
|    John|   33|   3|   3|
+--------+-----+----+----+

I am expecting the 'cnt1' column has the same value across the group, just like 'cnt2' column.

Gary Liu
  • 171
  • 2
  • 5
  • it is not a bug ... it is how window functions work ... please see this post https://stackoverflow.com/questions/50945477/count-rows-in-partition-with-order-by – thePurplePython Jul 18 '19 at 19:21
  • Possible duplicate of [Count rows in partition with Order By](https://stackoverflow.com/questions/50945477/count-rows-in-partition-with-order-by) – pault Jul 18 '19 at 19:40

0 Answers0