How to rank the column based on each occurrence in pyspark

Question

I have pyspark dataframe with 3 fields as shown below (Unique Idn, visit time, new visit), I want to create another new fields (visit seq - yellow highlighted) based on every time new visit happens (as 1) for each unique idn.

So every time there is new visit for each unique identifier, it should increment the counter. Its kind of visit number. Example for Unique Idn, 11, visit start at 1/11 6:24 and ends at 1/11 6:26 and its tagged as visit 2 starting from 1/11 6:24 till 1/11 6:26.

Does this answer your question? [Python Spark Cumulative Sum by Group Using DataFrame](https://stackoverflow.com/questions/45946349/python-spark-cumulative-sum-by-group-using-dataframe) — Nick ODell, May 23 '23 at 18:57
To expand on that a little: It seems like you want to partition by `unique idn`, then order by `visit time`, then take a cumulative sum of `new visit`. — Nick ODell, May 23 '23 at 18:59
So every time there is new visit per unique identifier, it should increment the counter. so its kinds of visit number. example for Unique Idn, 11, visit start at 1/11 6:24 and ends at 1/11 6:26 and its tagged as visit 2 starting from 1/11 6:24 till 1/11 6:26. — user1403789, May 23 '23 at 19:43

score 0 · Accepted Answer · edited May 24 '23 at 00:20

0

Ranking in PySpark can be achieved by using the Windows Function. For obtaining sequence number, the row_number function can be used.

Snippet:

from pyspark.sql import Window as W
import pyspark.sql.functions as F

window_spec = W.partitionBy(["unique idn"]).orderBy(F.col("visit time").asc()).rowsBetween(W.unboundedPreceding, Window.currentRow)

df_visit_seq = df.withColumn("visit_seq",F.sum(F.col("new visit")).over(window_spec))

edited May 24 '23 at 00:20

Jeremy Caney

7,102
69
48
77

answered May 23 '23 at 19:11

Leela Samyuktha

16
2

No this did not work, it creates the sequence like row number. – user1403789 May 23 '23 at 19:44
I have edited the solution. Can you chk with the new one – Leela Samyuktha May 23 '23 at 21:31

How to rank the column based on each occurrence in pyspark

1 Answers1