I have a dataset which looks like this:
+---+-------------------------------+--------+
|key|value |someData|
+---+-------------------------------+--------+
|1 |AAA |5 |
|1 |VVV |6 |
|1 |DDDD |8 |
|3 |rrerw |9 |
|4 |RRRRR |13 |
|6 |AAAAABB |15 |
|6 |C:\Windows\System32\svchost.exe|20 |
+---+-------------------------------+--------+
Now, I apply aggregative avg
function twice, first over ordered Window, later on unordered window, the results are not the same example:
WindowSpec windowSpec = Window.orderBy(col("someData")).partitionBy(col("key"));
rawMapping.withColumn("avg", avg("someData").over(windowSpec)).show(false);
+---+-------------------------------+--------+-----------------+
|key|value |someData|avg |
+---+-------------------------------+--------+-----------------+
|1 |AAA |5 |5.0 |
|1 |VVV |6 |5.5 |
|1 |DDDD |8 |6.333333333333333|
|6 |AAAAABB |15 |15.0 |
|6 |C:\Windows\System32\svchost.exe|20 |17.5 |
|3 |rrerw |9 |9.0 |
|4 |RRRRR |13 |13.0 |
+---+-------------------------------+--------+-----------------+
WindowSpec windowSpec2 = Window.partitionBy(col("key"));
rawMapping.withColumn("avg", avg("someData").over(windowSpec2)).show(false);
+---+-------------------------------+--------+-----------------+
|key|value |someData|avg |
+---+-------------------------------+--------+-----------------+
|1 |AAA |5 |6.333333333333333|
|1 |VVV |6 |6.333333333333333|
|1 |DDDD |8 |6.333333333333333|
|6 |AAAAABB |15 |17.5 |
|6 |C:\Windows\System32\svchost.exe|20 |17.5 |
|3 |rrerw |9 |9.0 |
|4 |RRRRR |13 |13.0 |
+---+-------------------------------+--------+-----------------+
When the window is oredered, the aggregative function has a "sliding window" behavior, why is this happening? and more importantly, is it a bug or a feature?