How to use first and last function in pyspark?

Question

I used first and last functions to get first and last values of one column. But, I found the both of functions don't work as what I supposed. I referred to the answer @zero323, but I am still confusing with the both. the code like:

df = spark.sparkContext.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)
]).toDF(["k", "v"])
w = Window().partitionBy("k").orderBy('k','v')

df.select(F.col("k"), F.last("v",True).over(w).alias('v')).show()

the result:

+---+----+
|  k|   v|
+---+----+
|  b|   1|
|  b|   3|
|  a|null|
|  a|  -1|
|  a|   1|
+---+----+

I supposed it should be like:

+---+----+
|  k|   v|
+---+----+
|  b|   3|
|  b|   3|
|  a|   1|
|  a|   1|
|  a|   1|
+---+----+

because, I showed df by operation of orderBy on 'k' and 'v':

df.orderBy('k','v').show()
    +---+----+
    |  k|   v|
    +---+----+
    |  a|null|
    |  a|  -1|
    |  a|   1|
    |  b|   1|
    |  b|   3|
    +---+----+

additionally, I figured out the other solution to test this kind of problems, my code like:

df.orderBy('k','v').groupBy('k').agg(F.first('v')).show()

I found that it was possible that its results are different after running above it every time . Was someone met the same experience like me? I hope to use the both of functions in my project, but I found those solutions are inconclusive.

I just use those code to simulate the situation which I met. In my project, I actually need the function like first and last. Maybe, there are some alternatives. — Ivan Lee, Mar 30 '17 at 10:50

score 20 · Accepted Answer · answered Jun 16 '17 at 23:28

Try inverting the sort order using .desc() and then first() will give the desired output.

w2 = Window().partitionBy("k").orderBy(df.v.desc())
df.select(F.col("k"), F.first("v",True).over(w2).alias('v')).show()
F.first("v",True).over(w2).alias('v').show()

Outputs:

+---+---+
|  k|  v|
+---+---+
|  b|  3|
|  b|  3|
|  a|  1|
|  a|  1|
|  a|  1|
+---+---+

You should also be careful about partitionBy vs. orderBy. Since you are partitioning by 'k', all of the values of k in any given window are the same. Sorting by 'k' does nothing.

The last function is not really the opposite of first, in terms of which item from the window it returns. It returns the last non-null, value it has seen, as it progresses through the ordered rows.

To compare their effects, here is a dataframe with both function/ordering combinations. Notice how in column 'last_w2', the null value has been replaced by -1.

df = spark.sparkContext.parallelize([
    ("a", None), ("a", 1), ("a", -1), ("b", 3), ("b", 1)]).toDF(["k", "v"])

#create two windows for comparison.
w = Window().partitionBy("k").orderBy('v')
w2 = Window().partitionBy("k").orderBy(df.v.desc())

df.select('k','v',
   F.first("v",True).over(w).alias('first_w1'),
   F.last("v",True).over(w).alias('last_w1'),
   F.first("v",True).over(w2).alias('first_w2'),
   F.last("v",True).over(w2).alias('last_w2')
).show()

Output:

+---+----+--------+-------+--------+-------+
|  k|   v|first_w1|last_w1|first_w2|last_w2|
+---+----+--------+-------+--------+-------+
|  b|   1|       1|      1|       3|      1|
|  b|   3|       1|      3|       3|      3|
|  a|null|    null|   null|       1|     -1|
|  a|  -1|      -1|     -1|       1|     -1|
|  a|   1|      -1|      1|       1|      1|
+---+----+--------+-------+--------+-------+

The spark documentation says ```The function is non-deterministic because its results depends on the order of the rows which may be non-deterministic after a shuffle.``` Does this mean that the first within each partitions is given and this can change depending on the shuffle? https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.first.html?highlight=first — Chogg, Feb 25 '22 at 23:31

score 13 · Answer 2 · answered Dec 13 '18 at 11:14

Have a look at Question 47130030.
The issue is not with the last() function but with the frame, which includes only rows up to the current one.
Using

w = Window().partitionBy("k").orderBy('k','v').rowsBetween(W.unboundedPreceding,W.unboundedFollowing)

will yield correct results for first() and last().

How to use first and last function in pyspark?

2 Answers2