19

I used first and last functions to get first and last values of one column. But, I found the both of functions don't work as what I supposed. I referred to the answer @zero323, but I am still confusing with the both. the code like:

df = spark.sparkContext.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)
]).toDF(["k", "v"])
w = Window().partitionBy("k").orderBy('k','v')

df.select(F.col("k"), F.last("v",True).over(w).alias('v')).show()

the result:

+---+----+
|  k|   v|
+---+----+
|  b|   1|
|  b|   3|
|  a|null|
|  a|  -1|
|  a|   1|
+---+----+

I supposed it should be like:

+---+----+
|  k|   v|
+---+----+
|  b|   3|
|  b|   3|
|  a|   1|
|  a|   1|
|  a|   1|
+---+----+

because, I showed df by operation of orderBy on 'k' and 'v':

df.orderBy('k','v').show()
    +---+----+
    |  k|   v|
    +---+----+
    |  a|null|
    |  a|  -1|
    |  a|   1|
    |  b|   1|
    |  b|   3|
    +---+----+

additionally, I figured out the other solution to test this kind of problems, my code like:

df.orderBy('k','v').groupBy('k').agg(F.first('v')).show()

I found that it was possible that its results are different after running above it every time . Was someone met the same experience like me? I hope to use the both of functions in my project, but I found those solutions are inconclusive.

Community
  • 1
  • 1
Ivan Lee
  • 3,420
  • 4
  • 30
  • 45

2 Answers2

20

Try inverting the sort order using .desc() and then first() will give the desired output.

w2 = Window().partitionBy("k").orderBy(df.v.desc())
df.select(F.col("k"), F.first("v",True).over(w2).alias('v')).show()
F.first("v",True).over(w2).alias('v').show()

Outputs:

+---+---+
|  k|  v|
+---+---+
|  b|  3|
|  b|  3|
|  a|  1|
|  a|  1|
|  a|  1|
+---+---+

You should also be careful about partitionBy vs. orderBy. Since you are partitioning by 'k', all of the values of k in any given window are the same. Sorting by 'k' does nothing.

The last function is not really the opposite of first, in terms of which item from the window it returns. It returns the last non-null, value it has seen, as it progresses through the ordered rows.

To compare their effects, here is a dataframe with both function/ordering combinations. Notice how in column 'last_w2', the null value has been replaced by -1.

df = spark.sparkContext.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)]).toDF(["k", "v"])

#create two windows for comparison.
w = Window().partitionBy("k").orderBy('v')
w2 = Window().partitionBy("k").orderBy(df.v.desc())

df.select('k','v',
   F.first("v",True).over(w).alias('first_w1'),
   F.last("v",True).over(w).alias('last_w1'),
   F.first("v",True).over(w2).alias('first_w2'),
   F.last("v",True).over(w2).alias('last_w2')
).show()

Output:

+---+----+--------+-------+--------+-------+
|  k|   v|first_w1|last_w1|first_w2|last_w2|
+---+----+--------+-------+--------+-------+
|  b|   1|       1|      1|       3|      1|
|  b|   3|       1|      3|       3|      3|
|  a|null|    null|   null|       1|     -1|
|  a|  -1|      -1|     -1|       1|     -1|
|  a|   1|      -1|      1|       1|      1|
+---+----+--------+-------+--------+-------+
Will
  • 415
  • 8
  • 15
  • I don't understand, why last_w2 is not 1, 1, -1, -1, -1? – Laurynas G Nov 06 '20 at 13:24
  • The spark documentation says ```The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.``` Does this mean that the first within each partitions is given and this can change depending on the shuffle? https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.first.html?highlight=first – Chogg Feb 25 '22 at 23:31
13

Have a look at Question 47130030.
The issue is not with the last() function but with the frame, which includes only rows up to the current one.
Using

w = Window().partitionBy("k").orderBy('k','v').rowsBetween(W.unboundedPreceding,W.unboundedFollowing)

will yield correct results for first() and last().

Elke
  • 131
  • 1
  • 2